JensUweUlrich / Taxor

Fast and space-efficient taxonomic classification of long reads
BSD 3-Clause "New" or "Revised" License
43 stars 2 forks source link

Estimating abundance on ZYMO sample D6311 log dist #9

Closed humbleflowers closed 2 months ago

humbleflowers commented 3 months ago

Hello developers,

Thank you for the tool. I am benchmarking taxor version: 0.1.3 SeqAn version: 3.4.0-rc.1on ZYMO sample sequenced on ONT using prebuilt database containing Archaea, Bacteria, Fungii, Viruses.

 taxor search --index-file /taxor/refseq-abfv-k22-s12.hixf --query-file ZYMO_D6311_14.nanoq.10.1000.fastq.gz --output-file ZYMO_D6311_14.nanoq.10.1000.taxor --threads 30 --error-rate 0.15

 taxor profile --search-file ZYMO_D6311_14.nanoq.10.1000.taxor --cami-report-file ZYMO_D6311_14.nanoq.10.1000.taxor.cami  --seq-abundance-file ZYMO_D6311_14.nanoq.10.1000.taxor.abundance  --binning-file ZYMO_D6311_14.nanoq.10.1000.taxor.binning --sample-id ZYMO_D6311_14.nanoq.10.1000.taxor  --threads 30

I am surprised to see taxor predicted 38.59% Viruses in the sample The cami report file shows

@SampleID:ZYMO_D6311_14.nanoq.10.1000.taxor
@Version:0.10.0
@Ranks:superkingdom|phylum|class|order|family|genus|species
@@TAXID RANK    TAXPATH TAXPATHSN       PERCENTAGE
10239   superkingdom    10239   Viruses 38.5958
2       superkingdom    2       Bacteria        60.8576
1224    phylum  2|1224  Bacteria|Pseudomonadota 0.725765
1239    phylum  2|1239  Bacteria|Bacillota      60.1319
2731618 phylum  10239|2731618   Viruses|Uroviricota     38.3566
2732410 phylum  10239|2732410   Viruses|Hofneiviricota  0.239164
1236    class   2|1224|1236     Bacteria|Pseudomonadota|Gammaproteobacteria     0.725765
2731619 class   10239|2731618|2731619   Viruses|Uroviricota|Caudoviricetes      38.3566
2732411 class   10239|2732410|2732411   Viruses|Hofneiviricota|Faserviricetes   0.239164
91061   class   2|1239|91061    Bacteria|Bacillota|Bacilli      60.1319
        order   10239|2731618|2731619|  Viruses|Uroviricota|Caudoviricetes|     114.635
1385    order   2|1239|91061|1385       Bacteria|Bacillota|Bacilli|Bacillales   59.8872
186826  order   2|1239|91061|186826     Bacteria|Bacillota|Bacilli|Lactobacillales      0.244694
2732094 order   10239|2732410|2732411|2732094   Viruses|Hofneiviricota|Faserviricetes|Tubulavirales     0.239164
72274   order   2|1224|1236|72274       Bacteria|Pseudomonadota|Gammaproteobacteria|Pseudomonadales     0.725765
10860   family  10239|2732410|2732411|2732094|10860     Viruses|Hofneiviricota|Faserviricetes|Tubulavirales|Inoviridae  0.239164
1300    family  2|1239|91061|186826|1300        Bacteria|Bacillota|Bacilli|Lactobacillales|Streptococcaceae     0.244694
135621  family  2|1224|1236|72274|135621        Bacteria|Pseudomonadota|Gammaproteobacteria|Pseudomonadales|Pseudomonadaceae    0.725765
186817  family  2|1239|91061|1385|186817        Bacteria|Bacillota|Bacilli|Bacillales|Bacillaceae       0.398114
186820  family  2|1239|91061|1385|186820        Bacteria|Bacillota|Bacilli|Bacillales|Listeriaceae      59.4891
1301    genus   2|1239|91061|186826|1300|1301   Bacteria|Bacillota|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus       0.244694
1386    genus   2|1239|91061|1385|186817|1386   Bacteria|Bacillota|Bacilli|Bacillales|Bacillaceae|Bacillus      0.398114
1623287 genus   10239|2731618|2731619|||1623287 Viruses|Uroviricota|Caudoviricetes|||Detrevirus 0.24911
1637    genus   2|1239|91061|1385|186820|1637   Bacteria|Bacillota|Bacilli|Bacillales|Listeriaceae|Listeria     59.4891
2560098 genus   10239|2731618|2731619|||2560098 Viruses|Uroviricota|Caudoviricetes|||Beetrevirus        0.185685
2732875 genus   10239|2732410|2732411|2732094|10860|2732875     Viruses|Hofneiviricota|Faserviricetes|Tubulavirales|Inoviridae|Primolicivirus   0.239164
286     genus   2|1224|1236|72274|135621|286    Bacteria|Pseudomonadota|Gammaproteobacteria|Pseudomonadales|Pseudomonadaceae|Pseudomonas        0.725765
1129145 species 10239|2731618|2731619||||1129145        Viruses|Uroviricota|Caudoviricetes||||Pseudomonas phage phi297  0.110057
1129146 species 10239|2731618|2731619|||1623287|1129146 Viruses|Uroviricota|Caudoviricetes|||Detrevirus|Detrevirus PMG1 0.24911
1225792 species 10239|2731618|2731619||||1225792        Viruses|Uroviricota|Caudoviricetes||||Pseudomonas phage JBD25   0.437994
1449437 species 10239|2731618|2731619||||1449437        Viruses|Uroviricota|Caudoviricetes||||Pseudomonas phage vB_PaeP_Tr60_Ab31       0.104323
1458852 species 10239|2731618|2731619||||1458852        Viruses|Uroviricota|Caudoviricetes||||Listeria phage LP-030-3   26.7424
1591073 species 10239|2731618|2731619||||1591073        Viruses|Uroviricota|Caudoviricetes||||Listeria phage vB_LmoS_293        7.65777
1639    species 2|1239|91061|1385|186820|1637|1639      Bacteria|Bacillota|Bacilli|Bacillales|Listeriaceae|Listeria|Listeria monocytogenes      0.255591
1642    species 2|1239|91061|1385|186820|1637|1642      Bacteria|Bacillota|Bacilli|Bacillales|Listeriaceae|Listeria|Listeria innocua    16.7856
1755689 species 10239|2731618|2731619||||1755689        Viruses|Uroviricota|Caudoviricetes||||Pseudomonas phage YMC11/02/R656   0.132222
1777052 species 10239|2731618|2731619||||1777052        Viruses|Uroviricota|Caudoviricetes||||Pseudomonas phage JBD44   0.182903
2011081 species 10239|2732410|2732411|2732094|10860|2732875|2011081     Viruses|Hofneiviricota|Faserviricetes|Tubulavirales|Inoviridae|Primolicivirus|Primolicivirus Pf1        0.239164
2545800 species 2|1224|1236|72274|135621|286|2545800    Bacteria|Pseudomonadota|Gammaproteobacteria|Pseudomonadales|Pseudomonadaceae|Pseudomonas|Pseudomonas sp. FDAARGOS_761   0.162082
2560663 species 10239|2731618|2731619|||2560098|2560663 Viruses|Uroviricota|Caudoviricetes|||Beetrevirus|Beetrevirus JBD67      0.185685
2678528 species 2|1239|91061|1385|186820|1637|2678528   Bacteria|Bacillota|Bacilli|Bacillales|Listeriaceae|Listeria|Listeria sp. LM90SB2        42.4479
2866282 species 2|1224|1236|72274|135621|286|2866282    Bacteria|Pseudomonadota|Gammaproteobacteria|Pseudomonadales|Pseudomonadaceae|Pseudomonas|Pseudomonas sp. PS1(2021)      0.330644
287     species 2|1224|1236|72274|135621|286|287        Bacteria|Pseudomonadota|Gammaproteobacteria|Pseudomonadales|Pseudomonadaceae|Pseudomonas|Pseudomonas aeruginosa 0.23304

is there any way i can fix this? According to ZYMO website, this is the expected proportions

Listeria monocytogenes - 89.1%, Pseudomonas aeruginosa - 8.9%, Bacillus subtilis - 0.89%, Saccharomyces cerevisiae - 0.89%, Escherichia coli - 0.089%, Salmonella enterica - 0.089%, Lactobacillus fermentum - 0.0089%, Enterococcus faecalis - 0.00089%, Cryptococcus neoformans - 0.00089%, and Staphylococcus aureus - 0.000089%.

JensUweUlrich commented 2 months ago

This is an issue we have recognized with all tools in our benchmarking on taxonomic abundance. When you are using a database that consists of bacteria and viruses, all tools will recognize a bunch of bacterial reads as belonging to phages that infect the respected bacterial species. The indexed database has a much bigger impact on the results than the used tool. So in your case, it would make sense to use a bacteria-only database. I would also try to reduce the accepted error rate to 0.05 if your nanopore reads have a high quality, which could also resolve the issue.

humbleflowers commented 2 months ago

Thank you @JensUweUlrich. It makes sense. I am using new R10.4 library data, i will try with reducing error rate.