Closed granek closed 7 years ago
The issue is that assignTaxonomy
currently assumes all the input sequences are in 5'-3' orientation. I checked seq1, and the reverse complement of seq1:
unname(assignTaxonomy(c(seq1, dada2:::rc(seq1)), "~/tax/silva_nr_v123_train_set.fa.gz"))
[1,] "Eukaryota" "Opisthokonta" "Holozoa" "Metazoa_(Animalia)" "Eumetazoa" "Bilateria"
[2,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales" "Lactobacillaceae" "Lactobacillus"
The reverse complement is the correct assignment.
We do not currently have an automatic method in dada2 to fix this issue. One workaround is to classify both your sequences and the reverse-complement (dada2:::rc
) of the sequences, and pick the one that's Bacteria.
There should be an option in assignTaxonomy
that considers the input sequences in both orientations, and picks the one that assigns most accurately.
Thanks for your quick response! I figured there was something simple I was missing.
I started working yesterday on picking between the sequence and its reverse complement. My plan is to pick the max of the rowSums of bootstrap values between the two sequence orientations. Not done yet, but I'm happy to share code once it is.
Here is some wrapper code to do this. It is not pretty, but works with my limited testing. I am happy to try to incorporate this into taxonomy.R, but also happy to defer to you to do so.
library(dada2)
assignTaxonomyBothStrands = function(seqs, refFasta, outputBootstraps=FALSE, ...){
wrc.tax = assignTaxonomy(append(seqs,dada2:::rc(seqs)),
refFasta,
outputBootstraps=TRUE,
...)
wrc.bootsums = rowSums(wrc.tax[["boot"]])
num.seqs = length(seqs)
# seq.ids = names(rownames(wrc.tax[["boot"]]))
best_indices = c()
for (i in seq(num.seqs)){
indices = c(i,i+num.seqs)
# print(paste(indices, ":", seq.ids[indices]))
best_index = indices[which.max(wrc.bootsums[indices])]
# print(best_index)
best_indices[i] = best_index
}
if (outputBootstraps==TRUE){
return(list(tax=wrc.tax[["tax"]][best_indices,],
boot=wrc.tax[["boot"]][best_indices,]))
} else {
return(wrc.tax[["tax"]][best_indices,])
}
}
A new feature solving this issue was added in 83d6fa8a7bd494789d94c31678f314e29926c9c8
assignTaxonomy(..., tryRC=TRUE)
will test the forward and RC versions of each sequence, and classify using whichever best matched the reference sequences.
This gives the expected classifications for @granek example data above
dada2 is great, thanks for sharing such a useful tool!
I have a confusing result where assignTaxonomy returns Eukaryote assignments for bacterial 16s sequences. I have seen Issue #70 and confirmed by BLAST that these are actually bacterial 16s sequences.
I am using silva_nr_v123_train_set.fa.gz, R 3.3.2, dada2 1.2.0. Below are a subset of the sequences that have this problem. Thanks for looking into this!
Josh