Coding missing ranks for RDP classifier

shaunpwilkinson commented 6 years ago

I'm trying to wrangle NCBI reference sequences into a suitable format for the DADA2 implementation of the RDP Naive Bayes Classifier, and wondering how to code lineages that are missing one or more of the main taxonomic ranks. For example, the full NCBI lineage for accession number AB028237 is:

rank	name
no rank	root
no rank	cellular organisms
superkingdom	Eukaryota
no rank	Opisthokonta
kingdom	Metazoa
no rank	Eumetazoa
no rank	Bilateria
no rank	Protostomia
no rank	Lophotrochozoa
phylum	Mollusca
class	Gastropoda
subclass	Heterobranchia
no rank	lower Heterobranchia
superfamily	Acteonoidea
family	Acteonidae
genus	Pupa
species	Pupa strigosa

which has no order rank. Is there a special character or some other naming convention that should be used in the semicolon-delimited string for these entries? For example "Eukaryota;Mollusca;Gastropoda;;Acteonidae;Pupa;Pupa strigosa" or "Eukaryota;Mollusca;Gastropoda;-;Acteonidae;Pupa;Pupa strigosa". I see some reference databases simply omit the missing ranks, but I'm concerned that this would throw the bins out of alignment (i.e. end up lumping genera in with families, etc) when the classifier is trained.

benjjneb commented 6 years ago

I see some reference databases simply omit the missing ranks, but I'm concerned that this would throw the bins out of alignment (i.e. end up lumping genera in with families, etc) when the classifier is trained.

You are right, there needs to be an entry for each rank included int he assignTaxonomy training fasta, or it will be thrown off.

Is there a special character or some other naming convention that should be used in the semicolon-delimited string for these entries?

You can use whatever you like. If you want to sound like a fancy taxonomist, you can use Incertae sedis.

shaunpwilkinson commented 6 years ago

Thank you very much, that's a big help.

benjjneb / dada2

Coding missing ranks for RDP classifier #561