benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

Assigning taxonomy with silva_nr99_v_wSpecies_train_set #1122

Closed ShailNair closed 3 years ago

ShailNair commented 4 years ago

Hi, I see you have included the "silva_nr99_v138_wSpecies_train_set.fa.gz" database set. So, if we want to assign taxonomy down to species level in one go using this database, Is the code "assignTaxonomy" appropriate, or do we need to change something?

ps: I tried to " assignTaxonomy" using the above-said database but my R session was aborted (and lost the analyzed data). I am not sure if it is the code or some other unknown technical issue (My R sometimes do show such clumsiness).

benjjneb commented 4 years ago

I see you have included the "silva_nr99_v138_wSpecies_train_set.fa.gz" database set. So, if we want to assign taxonomy down to species level in one go using this database, Is the code "assignTaxonomy" appropriate, or do we need to change something?

If you have short-read 16S data, we do not recommend this method for species-assignment. Instead there is a specialized method assignSpecies (or addSpecies) based on exact matching that is appropriate for species-assignment to short-read 16S.

If you have long-read 16S (e.g. full-length 16S data from something like PacBio HiFi sequencing) then assignTaxonomy with the new reference is appropriate.

ps: I tried to " assignTaxonomy" using the above-said database but my R session was aborted (and lost the analyzed data). I am not sure if it is the code or some other unknown technical issue (My R sometimes do show such clumsiness).

The memory requirements of the species-level database are higher than the genus-level database, which could cause such a problem.

termithorbor commented 4 years ago

When I use silva_nr99_v_wSpecies_train_set wit addSpecies function I get the following error: Incorrect reference file format for assignSpecies (this looks like a file formatted for assignTaxonomy).

I guess it is okay to use assignTaxonomy in this case? And what does shoprt- and long-reads in this context mean? Is it a big problem to use assign TAxonomy with silva_nr99_v_wSpecies_train_set wit addSpecies function when you deal with Illumina MiSeq reads?

nr0cinu commented 4 years ago

Hi @termithorbor!

Illumina MiSeq data is short read data, therefore you should use silva_nr99_v138_train_set.fa.gz with assignTaxonomy() followed by silva_species_assignment_v138.fa.gz with addSpecies().

Do not use silva_nr99_v138_wSpecies_train_set.fa.gz with Illumina MiSeq data.

Best, Bela

termithorbor commented 4 years ago

But will this give different results then the old v138 db?

nr0cinu commented 4 years ago

version 2 should give you equal or improved classification results compared version 1 of silva_nr99_v138_train_set.fa.gz.

See https://zenodo.org/record/3986799:

Version 2 removes the dependence on preprocessed files from mothur, which results in a greater number of bacterial and archeal sequences.

termithorbor commented 4 years ago

Also when I use the old version for species assignment?

And any suggestions why I get this error when running assignTaxonomy: Error in C_assign_taxonomy2(seqs, rc(seqs), refs, ref.to.genus, tax.mat.int, : random_device: rdseed failed ?

I already tried multithread=FALSE but that does not help either...

But if I try it often enough, then it suddenly works.