Open max-mose-diversigen opened 4 years ago
assignTaxonomy
makes assignments on a per-sequence basis, it will not be affected by the order of the ASVs in the sequence table at all.
Thanks for the clarification, it looks like my issue lies elsewhere then.
Updating this issue and re-opening with an example script where shuffling the ASVs input produces a different output when passed to assignTaxonomy
:
library(dada2); packageVersion("dada2")
seqs <- c(
"GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCGTGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGCAACTGACGCTGAGACTCGAAAGCATGGGTAGCGAACAGGATTAGAAACCCTAGTAGTCC",
"GTGCCAGCCGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCGTGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGATAACTTGAGTGCAGAAGAGGGTAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGCAACTGACGCTGAGACTCGAAAGCATGGGTAGCGAACAGGATTAGAAACCCGTGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGATAACTTGAGTGCAGAAGAGGGTAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGCAACTGACGCTGAGACTCGAAAGCATGGGTAGCGAACAGGATTAGAAACCCCTGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGCGCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGGAACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCCTGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGATAACTTGAGTGCAGAAGAGGGTAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGCAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTTGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGATAACTTGAGTGCAGAAGAGGGTAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGCAACTGACGCTGAGACTCGAAAGCATGGGTAGCGAACAGGATTAGATACCCTGGTAGTCC",
"GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTCAGCAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTTGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGAAACCCGAGTAGTCC")
set.seed(100) # Initialize random number generator for reproducibility
taxa <- assignTaxonomy(seqs, "rdp_train_set_16.fa.gz", multithread=FALSE)
for (x in seq(10)) {
shuffled_seqs <- sample(seqs)
taxa_repeat <- assignTaxonomy(shuffled_seqs, "rdp_train_set_16.fa.gz", multithread=FALSE)
print(identical(sort(taxa), sort(taxa_repeat)))
}
Running this script, I see an output of:
[1] ‘1.16.0’
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] TRUE
[1] FALSE
[1] FALSE
Is that outcome to be expected? If not, I'd be curious to know why this might be happening.
Some additional notes:
Is that outcome to be expected? If not, I'd be curious to know why this might be happening.
Importantly, there is an element of randomness in the naive Bayesian classifier method that comes from selecting the subsamples of the sequence over which boot strapping assignment is performed.
Also important, due to an issue with R code not being thread safe, this bit of randomness is not fully controlled by set.seed(...)
, so even when doing this you can get random differences in the confidence of assignments as some levels. This shoudl typically be at relatively resolved taxonomic levels though.
Is that outcome to be expected? If not, I'd be curious to know why this might be happening.
Importantly, there is an element of randomness in the naive Bayesian classifier method that comes from selecting the subsamples of the sequence over which boot strapping assignment is performed.
Also important, due to an issue with R code not being thread safe, this bit of randomness is not fully controlled by
set.seed(...)
, so even when doing this you can get random differences in the confidence of assignments as some levels. This shoudl typically be at relatively resolved taxonomic levels though.
It does seem weird that it is completely deterministic when in the same order though. e.g. if we do not shuffle the sequences being passed to assignTaxonomy
we can run 10 iterations that are all identical.
We can probably resolve this in our use-case by sorting the rows/columns (but adding additional samples affects the row order sorting so we are still a little stuck as far as exact reproducibility when adding more data). I presume these differences will not be significant enough to warrant concern, but it complicates our integration testing and we need to be able to justify it to customers.
Hm... let me investigate a bit more, but I suspect what is happening is that the initialization state of the C-side random number generators is affected by the sequences previously classified, as the random device is initialized once and every access will "increment" to the next pseudo-random numbers. hence this behaviors -- the same random numbers when the sequences stay in the same order, but not when they aren't.
On the bigger issue of complete reproducibility, even that will break once you start performing multi-threading on datasets large enough to bother splitting into multiple threads (>100s of seuqences) as the individual threads won't maintain a consistent order each time.
The lack of completely reproducibility of assignTaxonomy
is a current annoyance that I want to fix... but need to figure out how to do so while setting the random number inits from the non-thread-safe R side of things.
Hm... let me investigate a bit more, but I suspect what is happening is that the initialization state of the C-side random number generators is affected by the sequences previously classified, as the random device is initialized once and every access will "increment" to the next pseudo-random numbers. hence this behaviors -- the same random numbers when the sequences stay in the same order, but not when they aren't.
That makes sense! Thanks for taking it seriously.
On the bigger issue of complete reproducibility, even that will break once you start performing multi-threading on datasets large enough to bother splitting into multiple threads (>100s of seuqences) as the individual threads won't maintain a consistent order each time.
We haven't actually seen this in our tests (which are all multithreaded) but it's not something we've sought to prove.
The lack of completely reproducibility of
assignTaxonomy
is a current annoyance that I want to fix... but need to figure out how to do so while setting the random number inits from the non-thread-safe R side of things.
It's feasible we can dedicate some developer time to looking into it if you point us to the area in question--although I certainly can't make any promises.
I'm gonna try to at least take a look into this again before our next release (which will be ~Oct). Perfect reproducibility (assuming the random number generator is pre-seeded identically) is a worthwhile goal.
@benjjneb First of all, thanks for all your efforts with DADA2!
I have added DADA2 as central building block to an amplicon analysis pipeline (https://nf-co.re/ampliseq & https://github.com/nf-core/ampliseq) and while developing also came across minor changes of bootstrap values (and therefore taxonomic assignments) between pipeline runs. Those differences are typically in the range of 1 bootstrap (of 100), but by using a bootstrap cutoff of 50 by default, some taxa levels can obviously fall/raise below/above the cutoff even with so little deviation.
The pipeline specifies for assignTaxonomy a seed, uses containers (I tested with singularity & quay.io/biocontainers/bioconductor-dada2:1.22.0--r41h399db7b_0
), but also multithreading (I tested with 2 cores; usually 20).
Here a part of the nextflow process for taxonomic classification if helpful at all: https://github.com/nf-core/ampliseq/blob/aed35e5b4bad907ed2bdf6587efb82ad2578e6b1/modules/local/dada2_taxonomy.nf#L25-L30
Is there anything I could improve to obtain perfect reproducibility? Is there any progress with DADA2 assignTaxonomy and perfect reproducibility? Or do you have a recommendation of a classifier that allows perfect reproducibility?
@d4straub As you have correctly surmised, the randomness of the bootstrapping in assignTaxonomy
is still not controlled by set.seed
R-side. Very briefly, when we implemented multi-threaded assignTaxonomy
we had to switch to a thread-safe random number generator for the C code, but did not figure out how to make said random number generation remain fully reproducible using a seed number.
Unfortunately, we have no active effort to fix this. I'm sure it is fixable, but thread-safe and reproducible random number generation sits a bit past the horizon of my current comp-sci skills.
To address your specific questions: No. No. And have you looked at DECIPHER:IdTaxa? It has a call-out section in the dada2 tutorial.
Thanks for all the info! I did hear of DECIPHER:IdTaxa, but never looked deeper into it, I will do so!
Hello,
When testing taxonomy assignment I noticed that the output would change if the order of the sequences in the sequence table was different (this apparently was due to the order in which the files containing our data was loaded). The order in terms of abundance was unchanged, though; for instance, if a sample contains a higher abundance of a given sequence that would be sorted before a sequence with a lower abundance, but there were differences in order when the abundances were the same. Not being familiar with the algorithm underpinning the
assignTaxonomy
function, would it make sense that these discrepancies in output could be attributable to this? Any help would be much appreciated!Thanks!