benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

Fully reproducible assignTaxonomy with random number initialization #1115

Open max-mose-diversigen opened 3 years ago

max-mose-diversigen commented 3 years ago

Hello,

When testing taxonomy assignment I noticed that the output would change if the order of the sequences in the sequence table was different (this apparently was due to the order in which the files containing our data was loaded). The order in terms of abundance was unchanged, though; for instance, if a sample contains a higher abundance of a given sequence that would be sorted before a sequence with a lower abundance, but there were differences in order when the abundances were the same. Not being familiar with the algorithm underpinning the assignTaxonomy function, would it make sense that these discrepancies in output could be attributable to this? Any help would be much appreciated!

Thanks!

benjjneb commented 3 years ago

assignTaxonomy makes assignments on a per-sequence basis, it will not be affected by the order of the ASVs in the sequence table at all.

max-mose-diversigen commented 3 years ago

Thanks for the clarification, it looks like my issue lies elsewhere then.

max-mose-diversigen commented 3 years ago

Updating this issue and re-opening with an example script where shuffling the ASVs input produces a different output when passed to assignTaxonomy:

library(dada2); packageVersion("dada2")
seqs <- c(
"GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCGTGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGCAACTGACGCTGAGACTCGAAAGCATGGGTAGCGAACAGGATTAGAAACCCTAGTAGTCC",
"GTGCCAGCCGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCGTGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGATAACTTGAGTGCAGAAGAGGGTAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGCAACTGACGCTGAGACTCGAAAGCATGGGTAGCGAACAGGATTAGAAACCCGTGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGATAACTTGAGTGCAGAAGAGGGTAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGCAACTGACGCTGAGACTCGAAAGCATGGGTAGCGAACAGGATTAGAAACCCCTGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGCGCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGGAACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCCTGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGATAACTTGAGTGCAGAAGAGGGTAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGCAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTTGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGATAACTTGAGTGCAGAAGAGGGTAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGCAACTGACGCTGAGACTCGAAAGCATGGGTAGCGAACAGGATTAGATACCCTGGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTCAGCAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTTGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCGAGTAGTCC")

set.seed(100) # Initialize random number generator for reproducibility
taxa <- assignTaxonomy(seqs, "rdp_train_set_16.fa.gz", multithread=FALSE)
for (x in seq(10)) {
  shuffled_seqs <- sample(seqs)
  taxa_repeat <- assignTaxonomy(shuffled_seqs, "rdp_train_set_16.fa.gz", multithread=FALSE)
  print(identical(sort(taxa), sort(taxa_repeat)))
}

Running this script, I see an output of:

[1] ‘1.16.0’
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] TRUE
[1] FALSE
[1] FALSE

Is that outcome to be expected? If not, I'd be curious to know why this might be happening.

zanieb commented 3 years ago

Some additional notes:

benjjneb commented 3 years ago

Is that outcome to be expected? If not, I'd be curious to know why this might be happening.

Importantly, there is an element of randomness in the naive Bayesian classifier method that comes from selecting the subsamples of the sequence over which boot strapping assignment is performed.

Also important, due to an issue with R code not being thread safe, this bit of randomness is not fully controlled by set.seed(...), so even when doing this you can get random differences in the confidence of assignments as some levels. This shoudl typically be at relatively resolved taxonomic levels though.

zanieb commented 3 years ago

Is that outcome to be expected? If not, I'd be curious to know why this might be happening.

Importantly, there is an element of randomness in the naive Bayesian classifier method that comes from selecting the subsamples of the sequence over which boot strapping assignment is performed.

Also important, due to an issue with R code not being thread safe, this bit of randomness is not fully controlled by set.seed(...), so even when doing this you can get random differences in the confidence of assignments as some levels. This shoudl typically be at relatively resolved taxonomic levels though.

It does seem weird that it is completely deterministic when in the same order though. e.g. if we do not shuffle the sequences being passed to assignTaxonomy we can run 10 iterations that are all identical.

We can probably resolve this in our use-case by sorting the rows/columns (but adding additional samples affects the row order sorting so we are still a little stuck as far as exact reproducibility when adding more data). I presume these differences will not be significant enough to warrant concern, but it complicates our integration testing and we need to be able to justify it to customers.

benjjneb commented 3 years ago

Hm... let me investigate a bit more, but I suspect what is happening is that the initialization state of the C-side random number generators is affected by the sequences previously classified, as the random device is initialized once and every access will "increment" to the next pseudo-random numbers. hence this behaviors -- the same random numbers when the sequences stay in the same order, but not when they aren't.

On the bigger issue of complete reproducibility, even that will break once you start performing multi-threading on datasets large enough to bother splitting into multiple threads (>100s of seuqences) as the individual threads won't maintain a consistent order each time.

The lack of completely reproducibility of assignTaxonomy is a current annoyance that I want to fix... but need to figure out how to do so while setting the random number inits from the non-thread-safe R side of things.

zanieb commented 3 years ago

Hm... let me investigate a bit more, but I suspect what is happening is that the initialization state of the C-side random number generators is affected by the sequences previously classified, as the random device is initialized once and every access will "increment" to the next pseudo-random numbers. hence this behaviors -- the same random numbers when the sequences stay in the same order, but not when they aren't.

That makes sense! Thanks for taking it seriously.

On the bigger issue of complete reproducibility, even that will break once you start performing multi-threading on datasets large enough to bother splitting into multiple threads (>100s of seuqences) as the individual threads won't maintain a consistent order each time.

We haven't actually seen this in our tests (which are all multithreaded) but it's not something we've sought to prove.

The lack of completely reproducibility of assignTaxonomy is a current annoyance that I want to fix... but need to figure out how to do so while setting the random number inits from the non-thread-safe R side of things.

It's feasible we can dedicate some developer time to looking into it if you point us to the area in question--although I certainly can't make any promises.

benjjneb commented 3 years ago

I'm gonna try to at least take a look into this again before our next release (which will be ~Oct). Perfect reproducibility (assuming the random number generator is pre-seeded identically) is a worthwhile goal.

d4straub commented 2 years ago

@benjjneb First of all, thanks for all your efforts with DADA2!

I have added DADA2 as central building block to an amplicon analysis pipeline (https://nf-co.re/ampliseq & https://github.com/nf-core/ampliseq) and while developing also came across minor changes of bootstrap values (and therefore taxonomic assignments) between pipeline runs. Those differences are typically in the range of 1 bootstrap (of 100), but by using a bootstrap cutoff of 50 by default, some taxa levels can obviously fall/raise below/above the cutoff even with so little deviation.

The pipeline specifies for assignTaxonomy a seed, uses containers (I tested with singularity & quay.io/biocontainers/bioconductor-dada2:1.22.0--r41h399db7b_0), but also multithreading (I tested with 2 cores; usually 20).

Here a part of the nextflow process for taxonomic classification if helpful at all: https://github.com/nf-core/ampliseq/blob/aed35e5b4bad907ed2bdf6587efb82ad2578e6b1/modules/local/dada2_taxonomy.nf#L25-L30

Is there anything I could improve to obtain perfect reproducibility? Is there any progress with DADA2 assignTaxonomy and perfect reproducibility? Or do you have a recommendation of a classifier that allows perfect reproducibility?

benjjneb commented 2 years ago

@d4straub As you have correctly surmised, the randomness of the bootstrapping in assignTaxonomy is still not controlled by set.seed R-side. Very briefly, when we implemented multi-threaded assignTaxonomy we had to switch to a thread-safe random number generator for the C code, but did not figure out how to make said random number generation remain fully reproducible using a seed number.

Unfortunately, we have no active effort to fix this. I'm sure it is fixable, but thread-safe and reproducible random number generation sits a bit past the horizon of my current comp-sci skills.

To address your specific questions: No. No. And have you looked at DECIPHER:IdTaxa? It has a call-out section in the dada2 tutorial.

d4straub commented 2 years ago

Thanks for all the info! I did hear of DECIPHER:IdTaxa, but never looked deeper into it, I will do so!