benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Function assignTaxonomy is not classifying sequences based on a custom database and generating many NA #1626

Closed edersonjesus closed 1 year ago

edersonjesus commented 1 year ago

Hello everyone! Hello @benjjneb !

I am analyzing Arbuscular Mycorrhizal Fungi (AMF) 18S rRNA sequences. For that, I am using the Maarjam database https://maarjam.ut.ee I downloaded it, created my own dada2-compatible database (you can see it in the attached file if you wish), and ran assignTaxonomy. I got lots of NAs in the output object, among them the most abundant ASVs. I blasted some of these NAs. Here is one:

AGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTCGGGGTCAGCAGGTTGGTCGTGCCAATGGTATGCACTGGCCTTGCTGATTCCTCCCTCTTGTAGAACCGTAATGCCATTAAGTTGGTGTTGCGGGGAAACAGGACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAGGCTAACGCCTGAATACATTAGCATGGAATAATGAAATAGGACGATCGATCCTATTTTGTTGGTTTCTA

I've got matches to AMF and even 100% coverage and maximum identity with a sequence within the Maarjam database. That is, the sequences should have been classified, but they were not. Somehow, dada2 is not identifying them, and I wonder what is happening, and if there is something I can do to solve this issue.

Here are the commands I am using. Previous commands are similar to the dada2 tutorial:

maarjam <- "maarjam_dada2.fasta"
taxa <- assignTaxonomy(seqtab.nochim, maarjam, tryRC = TRUE, taxLevels = c("Class", "Order", "Family", "Genus", "Species"), multithread = FALSE)

I've got this warning message, which may explain what is going on:

Warning message: In matrix(unlist(strsplit(genus.unq, ";")), ncol = td, byrow = TRUE) : data length [1569] is not a sub-multiple or multiple of the number of rows [314]

I saw a previous call about this type of message, but I still don't understand what is going on.

Thanks!

Ed

maarjam_dada2.txt

edersonjesus commented 1 year ago

I found out that I didn't add a semicolon at the end of each taxonomy string, such as in "Archaeosporomycetes;Archaeosporales;Archaeosporaceae;Archaeospora;Wirsel OTU21;".

After adding it to the names of all sequences, the warning message doesn't appear anymore and the hitherto unclassified ASVs are properly classified.

Thanks!

abu85 commented 1 year ago

Hi @edersonjesus , Nice to hear that you were abale to figure that out. Could you please share codes how did you make 'own dada2-compatible database'? I am also indend to make dada2-compatible maarjam database to asign taxonomy.

edersonjesus commented 1 year ago

Hi! It has been a while since I did that, but here follows the code. See if that works. I tried it again quickly, and it worked for me. You will also find a fasta file with the sequences, a VT type with the taxonomic information, and a file with the composite names that I created manually based on the BT type file.

I found useful information here: https://github.com/benjjneb/dada2/issues/581 and https://benjjneb.github.io/dada2/training.html

Hope that helps!

Cheers,

Ed

nomes_compostos.csv vt_types_from_05-06-2019.xls vt_types_fasta_from_05-06-2019.txt

edersonjesus commented 1 year ago

Don't forget to add the final ; to the name of each sequence. Cheers!