benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
474 stars 144 forks source link

Coding missing ranks for RDP classifier #561

Closed shaunpwilkinson closed 6 years ago

shaunpwilkinson commented 6 years ago

I'm trying to wrangle NCBI reference sequences into a suitable format for the DADA2 implementation of the RDP Naive Bayes Classifier, and wondering how to code lineages that are missing one or more of the main taxonomic ranks. For example, the full NCBI lineage for accession number AB028237 is:

rank name
no rank root
no rank cellular organisms
superkingdom Eukaryota
no rank Opisthokonta
kingdom Metazoa
no rank Eumetazoa
no rank Bilateria
no rank Protostomia
no rank Lophotrochozoa
phylum Mollusca
class Gastropoda
subclass Heterobranchia
no rank lower Heterobranchia
superfamily Acteonoidea
family Acteonidae
genus Pupa
species Pupa strigosa

which has no order rank. Is there a special character or some other naming convention that should be used in the semicolon-delimited string for these entries? For example "Eukaryota;Mollusca;Gastropoda;;Acteonidae;Pupa;Pupa strigosa" or "Eukaryota;Mollusca;Gastropoda;-;Acteonidae;Pupa;Pupa strigosa". I see some reference databases simply omit the missing ranks, but I'm concerned that this would throw the bins out of alignment (i.e. end up lumping genera in with families, etc) when the classifier is trained.

benjjneb commented 6 years ago

I see some reference databases simply omit the missing ranks, but I'm concerned that this would throw the bins out of alignment (i.e. end up lumping genera in with families, etc) when the classifier is trained.

You are right, there needs to be an entry for each rank included int he assignTaxonomy training fasta, or it will be thrown off.

Is there a special character or some other naming convention that should be used in the semicolon-delimited string for these entries?

You can use whatever you like. If you want to sound like a fancy taxonomist, you can use Incertae sedis.

shaunpwilkinson commented 6 years ago

Thank you very much, that's a big help.