benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
464 stars 142 forks source link

silva_nr99_v138_train_set.fa.gz harbours all sequences of Silva NR 99? #1162

Closed Marcel2907 closed 3 years ago

Marcel2907 commented 3 years ago

Dear Benjamin,

first of all: thanks a lot for providing the DADA2 tutorial for assigning 16S rRNA based taxonomy, I learned a lot and appreciate your work very much.

I have one question regarding the silva_nr99_v138_train_set.fa.gz file which is available on Zenodo. Does this file contain all microorganisms that are availabe in the current Silva V138 Nr99, or did you remove some sequences for having a smaller size? I am asking because I have around 20 % of sequences that can not be assigned to Genus level, and I was wondering if this is happening because of the unknown microbes inside my sample or the Train_set file? And as a second question, would you say that the NR file of Silva in general is fine for performing analysis of microbial communities or should I use the Parc file of Silva?

BTW, I used a Pacbio 16S-full length sequencing approach.

Thanks again a lot for all your work!

Best regards,

Marcel

benjjneb commented 3 years ago

Does this file contain all microorganisms that are availabe in the current Silva V138 Nr99, or did you remove some sequences for having a smaller size?

We removed almost all the Eukaryota (save for a random sample of 100 to serve as an outgroup). Essentialy all Bacteria/Archaea entries were kept.

I was wondering if this is happening because of the unknown microbes inside my sample or the Train_set file?

Could be. Another big factor is that a large number of entries only have taxonomy assigned down the genus level, simply because bacterial systematics is still a ways away from naming all the species that exist.