bluenote-1577 / sylph

ultrafast taxonomic profiling and genome querying for metagenomic samples by abundance-corrected minhash.
MIT License
183 stars 7 forks source link

Unexpected behaviour on the ZymoBIOMICS Fecal Reference #5

Closed fplazaonate closed 10 months ago

fplazaonate commented 10 months ago

Hi @bluenote-1577,

I have downloaded Illumina sequencing data of the ZymoBIOMICS Fecal Reference. Then, I have performed taxonomic profiling with sylph (database: gtdb_r207, parameters: --estimate-unknown --read-length 150) and with meteor, the tool we are currently developping based on species specific marker genes.

Then, I have compared species coverage of both tool in this log-scale scatterplot: image

As you can see, both tools have good agreement for high coverage (lambda) species but sylph starts overestimating coverage when it reestimates lambda with its statistical model. In addition, syplh fails to detect species below ~ 0.2x coverage which is one order of magnitude above the detection limit with simulated data.

There are strain mixtures of the same species in this sample that may explain this strange behaviour.

Your feedback is welcomed, Florian

bluenote-1577 commented 10 months ago

Hi @fplaza,

Thanks for the amazing benchmark.

This behaviour is interesting, and I believe it indicates that sylph's statistical model is not generalizing well enough to this set of sequencing conditions. Whether that's strain mixtures, exact sequencing technology, etc, I am not sure.

If you could link me the exact zymo dataset you're using, I would be happy to investigate further and see what's happening.

I'll leave this issue up as I investigate further. If I am unable to fix this and sylph ends up slightly overestimating coverage on real datasets, then I will update the manuscript/README with a warning. Perhaps this is a deeper technical problem that I will need to mathematically investigate on another project.

Thanks,

Jim

bluenote-1577 commented 10 months ago

A few other points:

  1. I should mention that in Supplementary Fig. 8 in our manuscript (version 1, as of this post), you can see that sylph underestimates ANI a bit for a real illumina dataset. This is equivalent to overestimating coverage, which you have shown.
  2. Also, in Fig. 1 C, sylph overestimates coverage a bit on even synthetic datasets for very low effective coverage
  3. Note the difference in "effective coverage" versus 'coverage". The former takes into account sequencing error and a read-length factor, and is smaller than "coverage". I'm not sure if your synthetic results used effective or true coverage (specified by -u and --read-length options)
bluenote-1577 commented 10 months ago

I ran an experiment on the Meslier et al. (2022) real dataset for Illumina reads downsampled to 1Gbp as specified on our manuscript.

image

I used minimap2 and coverm for x-axis coverage. Orange dots are not detected by sylph.

Similar behaviour occurs for the low-coverage genomes as you have presented. I'm using reference genomes with 100% ANI to the community, so this is a best-case scenario as far as detection limits. Our supplementary figures show that lower ANI means slightly lower detection limits.

fplazaonate commented 10 months ago

Hi @bluenote-1577 ,

Thanks for the quick reply.

Here is the Illumina dataset (official link on the Zymo website is dead): https://filesender.renater.fr/?s=download&token=6a4f92f1-fef4-4988-9f5a-01ee3c46fd11 PacBio sequencing data is available here: https://t.co/LyMykkaWCD

I have plotted True coverage. Notably, the results are the same with MetaPhlan4 (using this time relative abundance as MP4 does not compute coverage) image

Best, Florian

bluenote-1577 commented 10 months ago

Hi @fplaza,

I discovered an interesting artifact recently. Can you let me know if your reads are quality controlled? Importantly, are your reads deduplicated?

I found that duplicated reads in Illumina sequencing cause big issues for sylph and low abundances. Even 9% duplicated reads, which fastQC says is high-quality, ruins sylph's statistical model for low abundance calculations. I am working now on an algorithm in sylph for deduplication.

After running fastp -i reads1.fq -I reads2.fq -o dedup.1.fq --out2 dedup.2.fq -D, try running sylph sketch -1 dedup.1.fq -2 dedup.2.fq ... etc instead and let me know how that goes if you have time. EDIT: See below

Thanks again for your extremely helpful benchmarks and bringing this important issue up.

bluenote-1577 commented 10 months ago

Hi @fplaza,

I implemented a new sketching algorithm for paired-end reads in sylph that almost resolves this issue. I will be testing it and releasing the new algorithm in sylph v0.5.0 in the next few weeks.

If you want to try it, see the v0.5.0 branch in the sylph repository. You'll have to resketch your reads, though.

For a real mouse gut genome:

image

More tests will be to come...

fplazaonate commented 10 months ago

Hi @bluenote-1577,

Reads have been quality controlled with fastp but without deduplication. Duplication rate is around 8%. The same pattern appears on other paired-end Illumina sequencing data (human fecal metagenomes).

image

Could you create a binary executable for this new version so I can give it a try?

Florian

bluenote-1577 commented 10 months ago

@fplaza here's a link https://easyupload.io/4qqbgb

You can also compile it if you have rust available. Compilation is usually pretty easy

Run the same commands, making sure using -1 and -2 for sketching.

fplazaonate commented 10 months ago

Here is a comparison for the ZymoBIOMICS Fecal Reference. Results are indeed much better image

Same thing on the other human fecal metagenomes (sylph new profiles only) image

bluenote-1577 commented 10 months ago

Awesome!

I will test on my own and improve this in the upcoming week. I'll let you know when sylph v0.5 is out officially.

Much thanks.

fplazaonate commented 10 months ago

Great! I have tested another sequencing technology for which I see the same bias as for Illumina in sylph v0.4. I will open another issue.

fplazaonate commented 10 months ago

Hi @bluenote-1577 , Do you think you can provide a workaround for single end libraries as well?

bluenote-1577 commented 10 months ago

@fplaza

Yes I can do it for single-end libraries too. I will need to do a bit of testing to make sure deduplication is not too aggressive.

Also: the deduplication procedure in the current sylph v0.5 branch is more up to date than the binary I provided you and the final version will be different than the binary, just FYI.

bluenote-1577 commented 10 months ago

v0.5.0 Now released on github. Single and paired-end deduplication supported with options to remove deduplication. Closing this issue for now. Conda and static binary will be out in a bit.