dada2 assignTaxonomy and large, custom training sets

morien commented 4 years ago

I'm a bioinformatician with a research group studying - broadly - microbial ecology. We're doing lots of metabarcoding experiments, so in 2018 I created two custom taxonomy assignment training sets to be used with dada2's assignTaxonomy function for my lab. The first is an 18s dataset derived from the SILVA db that can be found here: https://zenodo.org/record/1447330#.XsQrIFNKiL4

I also created an equivalent training set from 16s sequences in SILVA, also clustered at 99% similarity*. The idea was we would have a larger pool of sequences for training and assignment for our lab's 16s metabarcoding experiments, and hopefully finer-grained ability to assign taxonomy to ASVs. This worked well enough for the last couple years, even though it often required machines with >200GB of RAM to run. We have access to machines that satisfy the compute requirements, so it isn't a problem in technical terms. I used this training set to successfully assign taxonomy to a handful of 16s metabarcoding datasets in 2018 and 2019.

At the beginning of the year, a new student in our lab started having trouble with this method of taxonomy assignment, where they would only get assignments at kingdom and phylum level, barring a handful of ASVs that had assignments up to order or family level. The rest of their assignments were "NA"s. Since they were a new coder, I spent quite a bit of time with them eliminating the possibility that it was a problem with their input files, with their code, or with the environment on the local or remote machine. We even did fresh installs of R, dada2, and all dependencies after that, on both MacOS and linux (Scientific Linux) systems. Since then, all the other students in the lab processing 16s data have run into the same issue at some point. At that point we reverted to using the version of SILVA provided by the dada2 devs for our 16s experiments. Last month I tested assignTaxonomy with our custom training set on a brand new system (CentOS), with fresh installs of R, dada2, and dependencies, and this system reproduces the same output the students do (mostly NAs as assignments beyond the kingdom or phylum level). At this point, I am relatively certain that the issue is that the bootstrap values produced by RDP aren't passing the default threshold in the assignTaxonomy() function and thus we get NA assignments for most ranks in most ASVs.

Since this is relatively new behaviour I wanted to ask the dada2 developers, have there been any changes to the assignTaxonomy() function or your implementation of RDP in the last year or so that might be linked to this change? This is my current best guess for why we have observed a change in output, given that the input 16s training sequence set has not been modified since 2018. It seems most plausible to me, at the moment, that changes in the assignTaxonomy() function itself might produce different output than before.

And additionally, can you think of any reason why RDP wouldn't be suited to larger training sets like the one I constructed, other than the fact that it would use quite a bit of memory?

the 16s training set quite a bit larger than the 18s training set at the zenodo link. uncompressed, the 16s set is ~850mb while the 18s train set is ~120mb

benjjneb commented 4 years ago

Since this is relatively new behaviour I wanted to ask the dada2 developers, have there been any changes to the assignTaxonomy() function or your implementation of RDP in the last year or so that might be linked to this change? This is my current best guess for why we have observed a change in output, given that the input 16s training sequence set has not been modified since 2018. It seems most plausible to me, at the moment, that changes in the assignTaxonomy() function itself might produce different output than before.

The one change that could have happened was related to the way identical matches were handled, see commit 2317b9bce4bbf51984a7399fdc66f983f45df646

This shoiuld be straightforward to test. What I recommend is to run the assignTaxonomy code, same code, but using both the version of dada2 prior to this change (e..g 1.12) and a current verison. This is pretty easy to work into code by using devtools:install_github("benjjneb/dada2", ref="v1.12") to install specific releases inline in the code. You can do the same with e..g v1.16.

So the question is, does this same only or not even assigned to the phylum level with your expanded reference databases happen with relase 1.12 as it does with later releases with that new code?

morien commented 4 years ago

Okay, so it took me a bit to get around to testing this, but yes, reverting to v1.12 from the current version does re-create the original behaviour (i.e. most ASVs are assigned taxonomy at genus or higher rank) . For identical data inputs, I now have a large majority of ASVs with assignments up to genus level, where with the latest versions that have that code change you mentioned, I have less than 1% of ASVs that have an assignment beyond phylum rank.

benjjneb commented 4 years ago

What I think this suggests is that in this database, there are usually multiple entries with different taxonomic assignments up to fairly high ranks that are identical over the best-matching sequence segment to many (most?) of your query ASVs.

To test this you can do the following (assuming enough memory to read in the whole reference file, and it is an exact hit in the database):

library(Biostrings)
library(ShorRead)
query <- "ACGT" # CHANGE ME to one of the ASV sequences that are not getting assigned w/ new version
ref <- readFastq("path/to/ref_train_set.fa")
exacts <- vcountPattern(query, sread(ref), mismatch=0) > 0 # Can increment mismatch if no exact hits
table(exacts) # Multiple TRUEs?
id(ref)[exacts] # List of the taxonomic headers for all exact hits, multiple taxonomies?

benjjneb commented 4 years ago

Feel free to reopen with more information on this behavior.

morien commented 4 years ago

@benjjneb Thanks for following up with this, here's the output (with v1.12) of one ASV that doesn't get assigned, plus code:

query <- "sequence here"
ref <- readFasta("/path/to/db/CO1.BOLD_genbank_combined.rep_set.unaligned.dada2.fa.gz")
exacts <- vcountPattern(query, sread(ref), max.mismatch=0) > 0
table(exacts) # Multiple TRUEs?

exacts
  FALSE    TRUE 
2731023       4

id(ref)[exacts]

  A BStringSet instance of length 4
    width seq
[1]   124 Eukaryota;Opisthokonta;Metazoa;Art...alanus sp. PsAkkeshi1;CAISN387-13
[2]    96 Eukaryota;Opisthokonta;Metazoa;Art...anidae;Pseudocalanus;;CAISN566-13
[3]   126 Eukaryota;Opisthokonta;Metazoa;Art...anus newmani;GBA11526-13|JX502997
[4]   114 Eukaryota;Opisthokonta;Metazoa;Art...us;Pseudocalanus newmani;JX502997

as.data.frame(id(ref)[exacts])

1 Eukaryota;Opisthokonta;Metazoa;Arthropoda;Hexanauplia;Clausocalanidae;Pseudocalanus;Pseudocalanus sp. PsAkkeshi1;CAISN387-13
2 Eukaryota;Opisthokonta;Metazoa;Arthropoda;Hexanauplia;Clausocalanidae;Pseudocalanus;;CAISN566-13
3 Eukaryota;Opisthokonta;Metazoa;Arthropoda;Hexanauplia;Clausocalanidae;Pseudocalanus;Pseudocalanus newmani;GBA11526-13|JX502997
4 Eukaryota;Opisthokonta;Metazoa;Arthropoda;Hexanauplia;Clausocalanidae;Pseudocalanus;Pseudocalanus newmani;JX502997

and here is the assignment this ASV got in the previous iteration of things, with a newer version of dada2 (v1.16): Eukaryota Opisthokonta Metazoa Arthropoda Insecta NA NA NA NA

benjjneb commented 4 years ago

The exact hits have the same taxonomy down through the 7th taxonomic so I would expect to get taxonomic assignment down to (close) to that level. So I'm not sure why this wouldn't be getting assigned beyond the Phylum rank...

What are the results when you also output the bootstrap values? assignTaxonomy(..., outputBootstraps = TRUE)

Also, can you post a couple of the ASV sequences that are showing this behavior, so that I can try running assignTaxonomy on them myself (I'll use the database from the first post).

morien commented 4 years ago

Alright, I'm running two assignTaxonomy() jobs, one with dada2 v1.12 and one with the current version. Input data are identical. I'm asking for bootstrap values in both cases, so we can compare the output. This is with the 18s database (silva v132) that's referenced in the original post. I'm attaching a file 25_ASVs.txt which has 25 ASVs from this dataset that you can play around with yourself. I will tell you which ones are the ones that exhibit that behaviour in a couple days (or once the two assign taxonomy jobs have finished, usually takes overnight...)

morien commented 4 years ago

@benjjneb alright, here's the results for the jobs I ran the other day. I pared them down to just the 25 ASVs attached in the last post I made. There are some interesting discrepancies. Several ASVs get assignment to the accession level (rank 8+) with 100% confidence in v 1.12, but in v 1.17 they are only assigned out to the 4th or 5th rank. There's also differences in the bootstrap values in many other ASVs that result in fewer assignments in the 6th, 7th, 8th ranks. taxonomy_table.dada2_v1.12.1.25_ASVs.txt taxonomy_table.dada2_v1.17.3.25_ASVs.txt

benjjneb commented 4 years ago

I did a quick investigation of one of the first major discrepancies, ASV8 in your 25, which had the following assignments:

Eukaryota   NA  NA  NA  NA  NA  NA  NA  100 18  18  18  18  18  18  18
Eukaryota   Archaeplastida  Rhodophyceae    Florideophycidae    Corallinophycidae   Corallina   Corallina_sp._SYK-2014  KM369038.1.1216 100 100 100 100 100 60  60  60 # 1.12.1

Code:

library(dada2); packageVersion("dada2")
path <- "~/Desktop/18S"
fn.tax <- file.path(path, "silva_132.18s.99_rep_set.dada2.fa")
sq1 <- "TGTAAGCGTATACCAAAGTTGTTGCAGTTAAAACGCTCGTAGTCGGACTTTGGCAATTCCGGATGTGTGCGCGTCGTGCGCGCACTTTGCAGGGGTTGCTTTTGTGGGCATGCGTGGGATGATGCGACTTTATTGTCGAACGCCCCTACGCAACCACCTTTTACTGTGAGAAAATTAGAGTGTTCAAAGCAGGCAATTGCCGTGAATACATTAGCATGGAATAATAGAATAGGACTCGGTCTTATTTTGTTGGTTTGTTAGGGCTGAGTAATGATTAAGAGGGACAGTTGGGGGCATTTGTATTCAGGCGCTAGAGGTGAAATTCTTAGATTGCCGGAAGACAAACTGCTGCGAAAGCGTCTGCCAAGGATGTTTTCATTGATCAAGAACGAAAGTAAGGGGAT"
ref <- getSequences(fn.tax)
ham <- nwhamming(sq1, ref, vec=TRUE) # Takes about a minute

sort(ham)[1:20]

[1] 3 3 5 5 5 5 5 5 6 7 7 7 7 11 12 13 15 15 15 16

names(ref)[order(ham)[1:12]]

 [1] "Eukaryota;Archaeplastida;Rhodophyceae;Florideophycidae;Corallinophycidae;Corallina;Corallina_sp._SYK-2014;KM369038.1.1216;"                   
 [2] "Eukaryota;Archaeplastida;Rhodophyceae;Florideophycidae;Corallinophycidae;Corallina_sp._ASD200;Corallina_sp._ASD200;EF628230.1.1537;"          
 [3] "Eukaryota;Archaeplastida;Rhodophyceae;Florideophycidae;Corallinophycidae;Corallina;uncultured_Corallinales;GQ917383.1.1787;"                  
 [4] "Eukaryota;Archaeplastida;Rhodophyceae;Florideophycidae;Corallinophycidae;Crusticorallina_adhaerens;Crusticorallina_adhaerens;KU983204.1.1530;"
 [5] "Eukaryota;Archaeplastida;Rhodophyceae;Florideophycidae;Corallinophycidae;Crusticorallina_painei;Crusticorallina_painei;KU983210.1.1530;"      
 [6] "Eukaryota;Archaeplastida;Rhodophyceae;Florideophycidae;Corallinophycidae;Corallina;Crusticorallina_painei;KX036713.1.1673;"                   
 [7] "Eukaryota;Archaeplastida;Rhodophyceae;Florideophycidae;Corallinophycidae;Crusticorallina_painei;Crusticorallina_painei;KU983209.1.1530;"      
 [8] "Eukaryota;Excavata;Metamonada;Fornicata;Giardiinae;Giardia;Giardia_intestinalis;AF199449.1.1420;"                                             
 [9] "Eukaryota;Archaeplastida;Rhodophyceae;Florideophycidae;Corallinophycidae;Bossiella;Chiharaea_bodegensis;KC157576.1.1738;"                     
[10] "Eukaryota;Archaeplastida;Rhodophyceae;Florideophycidae;Corallinophycidae;Ellisolandia;Ellisolandia_elongata;FM180095.1.1681;"                 
[11] "Eukaryota;Archaeplastida;Rhodophyceae;Florideophycidae;Corallinophycidae;Bossiella;Bossiella_orbigniana;U60746.1.1702;"                       
[12] "Eukaryota;Archaeplastida;Rhodophyceae;Florideophycidae;Corallinophycidae;Ellisolandia;Ellisolandia_elongata;EU095607.1.1787;"

What is notable, is that in that low divergence set of sequeneces in your reference database, there is a reference seqeuence from Giardia ([8]), from Phylum Excavata, that is also included. This reference sequences appears to effectively match as well as any of the other close sequences, at least once the query is chopped up into 8-mers as is done by the naive Bayesian classifier. In previous versions, "ties" in which bootstraps had equally good matches to multiple reference sequences were not handled properly, and the first entry was always taken as the representative of the tie. Now, a random member of the group of reference sequences equally best matching the query is taken (as it should be).

In this example, that is resulting in what appears to be a correct (according to the goals of the naive Bayesian classifier) non-assignment below the Kingdom level, because it is in fact ambiguous below that level based on your reference sequences, as driven by this Giardia entry.

Something to consider -- could that Giardia sequence be in error? By its nature, the updated version of assignTaxonomy is both more technically correct, and also more sensitive to mislabeled reference database entries.

morien commented 4 years ago

That's an interesting case, because giardia has a weird, short, V4 region of the 18S gene (that's the amplicon region we are using in this example data). If memory serves it's considerably shorter than most other eukaryotic organisms.

I checked this database entry by accession (the taxonomy is "correct" - it's labeled as a giardia sequence in genbank) and by sequence (the sequence in my database blasts to the same giardia accession at 100% identity). I also blasted the sequence that's in NCBI associated with this accession, and all top hits are also giardia. Genbank has lots of errors, but I don't think this is one of them.

One more weird thing. A priori, I would expect the giardia match, even if it is coincidentally as good as all the coralline algae matches, to cover (overall) a shorter portion of the query sequence, because giardia's V4 region is so short.

edit: Sorry I should have added, my question now is: Are these hits all ties? The hamming distance is 3 (the lowest score) for the top two hits, which are both Corallinophycidae species (as an aside, there appears to be an error in the genus label here, otherwise they would both also be labeled as genus Corallina). Shouldn't this sequence be assigned as "Eukaryota;Archaeplastida;Rhodophyceae;Florideophycidae;Corallinophycidae"? I can see that it's not, and it makes sense to you, but I don't understand.

benjjneb / dada2

dada2 assignTaxonomy and large, custom training sets #1032