allind / EukDetect

MIT License
43 stars 16 forks source link

Dude, Where's My Taxonomy? #21

Closed antgonza closed 2 years ago

antgonza commented 2 years ago

Hello,

I ran my samples through the EukDetect pipeline using the latests bowtie2/taxonomy files and everything went fine - thank you for making it so easy to run.

Anyway, I was looking to reconcile the *_filtered_hits_table.txt with the taxonomies found in marker_genes_per_species.csv and realized that at first glance my files are missing a few Taxid: 5579, 5658, 29901, 229219. Is this expected and/or am I missing something?

Just to give a little more details about this issue, one of my tables has these info:

Name    Taxid   Observed_markers        Read_counts     Percent_observed_markers        Total_marker_coverage   Percent_identity
Blastomyces     229219  4       7       2.7%    9.48%   100.0%
Malassezia globosa CBS 7966     425265  2       4       1.27%   10.34%  97.99%

and when I look for 229219 in marker_genes_per_species.csv nothing is found and where I look for Blastomyces, I actually found these but that Taxid is not present:

$ grep Blastomyces ~/EukDetect/eukdb/marker_genes_per_species.csv
Fungi,Ascomycota,Blastomyces_dermatitidis_ER-3,559297,51
Fungi,Ascomycota,Blastomyces_gilchristii_SLH14081,559298,55
Fungi,Ascomycota,Blastomyces_parvus,2060905,198
Fungi,Ascomycota,Blastomyces_percursus,1658174,200
Fungi,Ascomycota,Blastomyces_silverae,2060906,176
Fungi,Ascomycota,Blastomyces_sp_MA-2018,2164086,186

another example 29901, in one of my tables:

Mrakia  29901   2       4       1.89%   4.28%   100.0%

but not found:

$ grep Mrakia ~/EukDetect/eukdb/marker_genes_per_species.csv
Fungi,Basidiomycota,Mrakia_blollopis,696254,56
Fungi,Basidiomycota,Mrakia_frigida,29902,164
Fungi,Basidiomycota,Mrakia_psychrophila,72568,49

Thank you!

allind commented 2 years ago

This is expected behavior. The taxonomy ID reported in the output is of the Blastomyces genus.

The NCBI taxonomy is hierarchical, and the species taxIDs are children of genus taxIDs. Since there is conflicting evidence at the species level about which Blastomyces is present, eukdetect does not report a specific species but rather considers there to be enough evidence that there is something in the Blastomyces genus present.

More info is in the methods section of the paper.

antgonza commented 2 years ago

Thank you for the explanation, I think it makes sense.

Now, is there an easy way to reconcile those values without having to go to https://www.ncbi.nlm.nih.gov/Taxonomy/Browser? I guess something similar to marker_genes_per_species.csv but for all taxonomic levels?

allind commented 2 years ago

Take a look at the output in the {samplename}_filtered_hits_taxonomy.txt files - those provide the markers observed, number of reads observed, and other metrics for every taxonomic level above everything reported in the final filtered output.