Closed fplazaonate closed 10 months ago
Hi @fplaza,
Thanks for the amazing benchmark.
This behaviour is interesting, and I believe it indicates that sylph's statistical model is not generalizing well enough to this set of sequencing conditions. Whether that's strain mixtures, exact sequencing technology, etc, I am not sure.
If you could link me the exact zymo dataset you're using, I would be happy to investigate further and see what's happening.
I'll leave this issue up as I investigate further. If I am unable to fix this and sylph ends up slightly overestimating coverage on real datasets, then I will update the manuscript/README with a warning. Perhaps this is a deeper technical problem that I will need to mathematically investigate on another project.
Thanks,
Jim
A few other points:
I ran an experiment on the Meslier et al. (2022) real dataset for Illumina reads downsampled to 1Gbp as specified on our manuscript.
I used minimap2 and coverm for x-axis coverage. Orange dots are not detected by sylph.
Similar behaviour occurs for the low-coverage genomes as you have presented. I'm using reference genomes with 100% ANI to the community, so this is a best-case scenario as far as detection limits. Our supplementary figures show that lower ANI means slightly lower detection limits.
Hi @bluenote-1577 ,
Thanks for the quick reply.
Here is the Illumina dataset (official link on the Zymo website is dead): https://filesender.renater.fr/?s=download&token=6a4f92f1-fef4-4988-9f5a-01ee3c46fd11 PacBio sequencing data is available here: https://t.co/LyMykkaWCD
I have plotted True coverage. Notably, the results are the same with MetaPhlan4 (using this time relative abundance as MP4 does not compute coverage)
Best, Florian
Hi @fplaza,
I discovered an interesting artifact recently. Can you let me know if your reads are quality controlled? Importantly, are your reads deduplicated?
I found that duplicated reads in Illumina sequencing cause big issues for sylph and low abundances. Even 9% duplicated reads, which fastQC says is high-quality, ruins sylph's statistical model for low abundance calculations. I am working now on an algorithm in sylph for deduplication.
After running fastp -i reads1.fq -I reads2.fq -o dedup.1.fq --out2 dedup.2.fq -D, try running EDIT: See belowsylph sketch -1 dedup.1.fq -2 dedup.2.fq ... etc
instead and let me know how that goes if you have time.
Thanks again for your extremely helpful benchmarks and bringing this important issue up.
Hi @fplaza,
I implemented a new sketching algorithm for paired-end reads in sylph that almost resolves this issue. I will be testing it and releasing the new algorithm in sylph v0.5.0 in the next few weeks.
If you want to try it, see the v0.5.0 branch in the sylph repository. You'll have to resketch your reads, though.
For a real mouse gut genome:
More tests will be to come...
Hi @bluenote-1577,
Reads have been quality controlled with fastp but without deduplication. Duplication rate is around 8%. The same pattern appears on other paired-end Illumina sequencing data (human fecal metagenomes).
Could you create a binary executable for this new version so I can give it a try?
Florian
@fplaza here's a link https://easyupload.io/4qqbgb
You can also compile it if you have rust available. Compilation is usually pretty easy
Run the same commands, making sure using -1
and -2
for sketching.
Here is a comparison for the ZymoBIOMICS Fecal Reference. Results are indeed much better
Same thing on the other human fecal metagenomes (sylph new profiles only)
Awesome!
I will test on my own and improve this in the upcoming week. I'll let you know when sylph v0.5 is out officially.
Much thanks.
Great! I have tested another sequencing technology for which I see the same bias as for Illumina in sylph v0.4. I will open another issue.
Hi @bluenote-1577 , Do you think you can provide a workaround for single end libraries as well?
@fplaza
Yes I can do it for single-end libraries too. I will need to do a bit of testing to make sure deduplication is not too aggressive.
Also: the deduplication procedure in the current sylph v0.5 branch is more up to date than the binary I provided you and the final version will be different than the binary, just FYI.
v0.5.0 Now released on github. Single and paired-end deduplication supported with options to remove deduplication. Closing this issue for now. Conda and static binary will be out in a bit.
Hi @bluenote-1577,
I have downloaded Illumina sequencing data of the ZymoBIOMICS Fecal Reference. Then, I have performed taxonomic profiling with sylph (database: gtdb_r207, parameters: --estimate-unknown --read-length 150) and with meteor, the tool we are currently developping based on species specific marker genes.
Then, I have compared species coverage of both tool in this log-scale scatterplot:
As you can see, both tools have good agreement for high coverage (lambda) species but sylph starts overestimating coverage when it reestimates lambda with its statistical model. In addition, syplh fails to detect species below ~ 0.2x coverage which is one order of magnitude above the detection limit with simulated data.
There are strain mixtures of the same species in this sample that may explain this strange behaviour.
Your feedback is welcomed, Florian