amkozlov / sativa

Semi-Automatic Taxonomy Improvement and Validation Algorithm
GNU General Public License v3.0
16 stars 7 forks source link

How to understand the levels in the result file? #5

Open billzt opened 1 year ago

billzt commented 1 year ago

This is a row of my result

;SeqID  MislabeledLevel OriginalLabel   ProposedLabel   Confidence  OriginalTaxonomyPath    ProposedTaxonomyPath    PerRankConfidence
MN952976    Domain      Pempheriformes  0.999   Metazoa;Chordata;Actinopteri;;Lactariidae;Lactarius;Lactarius lactarius Metazoa;Chordata;Actinopteri;Pempheriformes 0.999;0.999;0.999;0.999

My question is: (1)Why the MislabeledLevel is "Domain"? (2)How to prepare the taxonomic annotations file in -t correctly? How to do if certain level is missing? Currently I just put an empty string there

amkozlov commented 11 months ago

Sorry for the late response.

SATIVA requires balanced taxonomy, i.e. all sequences should ideally have the same number of taxonomic levels, or at least the should be no 'holes' or empty labels.

In your example

Metazoa;Chordata;Actinopteri;;Lactariidae;Lactarius;Lactarius lactarius

should be replaced by

Metazoa;Chordata;Actinopteri;Perciformes;Lactariidae;Lactarius;Lactarius lactarius

according to Wikipedia

Should the label be indeed missing at some rank due to unbalanced taxonomy, you could introduce artificial (but non-empty!) label, e.g. PerciformesFamily1.

Finally, please add -x zoo option so specify that you are using zoological taxonomic code.