Closed shaunpwilkinson closed 6 years ago
I see some reference databases simply omit the missing ranks, but I'm concerned that this would throw the bins out of alignment (i.e. end up lumping genera in with families, etc) when the classifier is trained.
You are right, there needs to be an entry for each rank included int he assignTaxonomy
training fasta, or it will be thrown off.
Is there a special character or some other naming convention that should be used in the semicolon-delimited string for these entries?
You can use whatever you like. If you want to sound like a fancy taxonomist, you can use Incertae sedis.
Thank you very much, that's a big help.
I'm trying to wrangle NCBI reference sequences into a suitable format for the DADA2 implementation of the RDP Naive Bayes Classifier, and wondering how to code lineages that are missing one or more of the main taxonomic ranks. For example, the full NCBI lineage for accession number AB028237 is:
which has no order rank. Is there a special character or some other naming convention that should be used in the semicolon-delimited string for these entries? For example "Eukaryota;Mollusca;Gastropoda;;Acteonidae;Pupa;Pupa strigosa" or "Eukaryota;Mollusca;Gastropoda;-;Acteonidae;Pupa;Pupa strigosa". I see some reference databases simply omit the missing ranks, but I'm concerned that this would throw the bins out of alignment (i.e. end up lumping genera in with families, etc) when the classifier is trained.