benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

assignTaxonomy returns Eukaryota for Bacterial sequences #165

Closed granek closed 7 years ago

granek commented 7 years ago

dada2 is great, thanks for sharing such a useful tool!

I have a confusing result where assignTaxonomy returns Eukaryote assignments for bacterial 16s sequences. I have seen Issue #70 and confirmed by BLAST that these are actually bacterial 16s sequences.

I am using silva_nr_v123_train_set.fa.gz, R 3.3.2, dada2 1.2.0. Below are a subset of the sequences that have this problem. Thanks for looking into this!

Josh

seq1 TACCCCACCGGCTTTGGGTGTTACAAACTCTCATGGTGTGACGGGCGGTGTGTACAAGGCCCGGGAACGTATTCACCGCGGCATGCTGATCCGCGATTACTAGCGATTCCGACTTCATGTAGGCGAGTTGCAGCCTACAATCCGAACTGAGAACGGCTTTAAGAGATTTGCTAAACCTCGCGGTCTTGCGACTCGTTGTACCGTCCATTGTAGCACGTGTGTAGCCCAGGTCATAAGGGGCATGATGATTTGACGTCATCCCCACCTTCCTCCGGTTTGTCACCGGCAGTCTTGCTAGAGTGCCCAACTTAATGCTGGCAACTAACAATAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAACCATGCACCACCTGTCATTTTGTCCCCGAAGGGAAAGTCCTATCTCTAGGATTGTCAAAAGATGTCAAGACCTGGTAAGGTTCTTCGCGTTGCTTCGAATTAAACCACATGCTCCACCGCTTGTGCGGGCCCCCGTCAATTCCTTTGAGTTTCAACCTTGCGGTCGTACTCCCCAGGCGGAATGCTTATTGCGTTAGCTGCAGCACTGAAGGGCGGAAACCCTCCAACACTTAGCATTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTACCCACGCTTTCGAACCTCAGCGTCAGTTACAGACCAGAGAGCCGCTTTCGCCACTGGTGTTCTTCCATATATCTACGCATTTCACCGCTACACATGGAGTTCCACTCTCCTCTTCTGCACTCAAGTCTCCCAGTTTCCAATGCACTACTCCGGTTAAGCCGAAGGCTTTCACATCAGACTTAAAAGA seq2 GGGCGGTGTGTACAAGGCCCGGGAACGTATTCACCGCGGCGTGCTGATCCGCGATTACTAGCGATTCCGGCTTCATGTAGGCGAGTTGCAGCCTACAATCCGAACTGAGAGAAGCTTTAAGAGATTAGCTTAGCCTCGCGACTTCGCAACTCGTTGTACTTCCCATTGTAGCACGTGTGTAGCCCAGGTCATAAGGGGCATGATGATTTGACGTCATCCCCACCTTCCTCCGGTTTGTCACCGGCAGTCTCGCTAGAGTGCCCAACTAAATGATGGCAACTAACAATAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAACCATGCACCACCTGTCACTTTGCCCCCGAAGGGGAAGCTCTATCTCTAGAGTGGTCAAAGGATGTCAAGACCTGGTAAGGTTCTTCGCGTTGCTTCGAATTAAACCACATGCTCCACCGCTTGTGCGGGCCCCCGTCAATTCCTTTGAGTTTCAACCTTGCGGTCGTACTCCCCAGGCGGAGTGCTTAATGCGTTTGCTGCAGCACTGAAGGGCGGAAACCCTCCAACACTTAGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTCGCTCCCCACGCTTTCGAGCCTCAGCGTCAGTTACAGACCAGAGAGCCGCCTTCGCCACTGGTGTTCCTCCATATATCTACGCATTTCACCGCTACACATGGAATTCCACTCTCCTCTTCTGCACTCAAGTCTCCCAGTTTCCAATGACCCTCCCCGGTTGAGCCGGGGGCTTTCACATCAGACTTAAGAAACCGCCTGCGCTCGCTTTACGCCCAATAAATCCGGACAACGCTTGCCACCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGTGG seq3 TACTCCACCGGCTTCGGGTGTTACAAACTCTCGTGGTGTGACGGGCGGTGTGTACAAGACCCGGGAACGTATTCACCGTAGCATGCTGATCTACGATTACTAGCGATTCCAGCTTCATGTAGTCGAGTTGCAGACTACAATCCGAACTGAGAATAATTTTATGGGATTTGCTTGACCTCGCGGGTTCGCTGCCCTTTGTATTATCCATTGTAGCACGTGTGTAGCCCAAATCATAAGGGGCATGATGATTTGACGTCATCCCCACCTTCCTCCGGTTTGTCACCGGCAGTCAACCTAGAGTGCCCAACTTAATGATGGCAACTAAGCTTAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAACCATGCACCACCTGTCACTTTGTCCCCCGAAGGGGAAAACTCTATCTCTAGAGCGGTCAAAGGATGTCAAGATTTGGTAAGGTTCTTCGCGTTGCTTCGAATTAAACCACATGCTCCACCGCTTGTGCGGGTCCCCGTCAATTCCTTTGAGTTTCAACCTTGCGGTCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTGCAGCACTAAGGGGCGGAAACCCCCTAACACTTAGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGATCCCCACGCTTTCGCACATCAGCGTCAGTTACAGACCAGAGAGCCGCCTTCGCCACTGGTGTTCCTCCATATCTCTGCGCATTTCACCGCTACACATGGAATTCCACTCTCCTCTTCTGCACTCAAGTTCCCCAGTTTCCAATGACCCTCCACGGTTGAGCCGTGGGCTTTCACATCAGACTTAAGAAACCGCCTACGCGCGCTTTACGCCCAATAATTCCGGATAACGCTTGCCACCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGTGG seq4 GTTCCACCTCGCGGCTTCACTGCCCGTTGTACCGGCCATTGTAGTACGTGTGTAGCCCAGGTCATAAGGGGCATGATGATTTGACGTCATCCCCACCTTCCTCCGGTTTGTCACCGGCAGTCACCTTAGAGTGCCCACCCGAAGTGCTGGCAACTAAGATCAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAACCATGCACCACCTGTCTCCTCTGTCCCGAAGGAAAGGTACATCTCTGTACCGGTCAGAGGGATGTCAAGACCTGGTAAGGTTCTTCGCGTTGCTTCGAATTAAACCACATACTCCACTGCTTGTGCGGGTCCCCGTCAATTCCTTTGAGTTTCAGTCTTGCGACCGTACTCCCCAGGCGGAGTGCTTAATGTGTTAACTTCGGCACCAAGGGTATCGAAACCCCTAACACCTAGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCGCCTCAGCGTCAGTTACAGCCCAGAGAGTCGCCTTCGCCACTGGTGTTCCTCCACATATCTACGCATTTCACCGCTACACGTGGAATTCCACTCTCCTCTTCTGCACTCAAGTCACCCAGTTTCCAGTGCGATCCGGGGTTGAGCCCCGGGATTAAACACCAGACTTAAATGACCGCCTGCGCGCGCTTTACGCCCAATAATTCCCGGACAACGCTTGCCCCCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGGGGCTTTCTTCTCAGGTACCGTCACCTTG seq5 AAAAGGTTACCTCACCGACTTCGGGTGTTACAAACTCTCGTGGTGTGACGGGCGGTGTGTACAAGGCCCGGGAACGTATTCACCGCGGCGTGCTGATCCGCGATTACTAGCGATTCCGGCTTCATGTAGGCGAGTTGCAGCCTACAATCCGAACTGAGAGAAGCTTTAAGAGATTAGCTTAGCCTCGCGACTTCGCAACTCGTTGTACTTCCCATTGTAGCACGTGTGTAGCCCAGGTCATAAGGGGCATGATGATTTGACGTCATCCCCACCTTCCTCCGGTTTGTCACCGGCAGTCTCGCTAGAGTGCCCAACTAAATGATGGCAACTAACAATAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAACCATGCACCACCTGTCACTTTGCCCCCGAAGGGGAAGCTCTATCTCTAGAGTGGTCAAAGGATGTCAAGACCTGGTAAGGTTCTTCGCGTTGCTTCGAATTAAACCACATGCTCCACCGCTTGTGCGGGCCCCCGTCAATTCCTTTGAGTTTCAACCTTGCGGTCGTACTCCCCAGGCGGAGTGCTTAATGCGTTTGCTGCAGCACTGAAGGGCGGAAACCCTCCAACACTTAGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTCGCTCCCCACGCTTTCGAGCCTCAGCGTCAGTTACAGACCAGAGAGCCGCCTTCGCCACTGGTGTTCCTCCATATATCTACGCATTTCACCGCTACACATGGAATTCCACTCTCCTCTTCTGCACTCAAGTCTCCCAGTTTCCAATGACCCTCCCCGGTTGAGCCGGGGGCTTTCACATCAGACTTAAGAAACCGCCTGCGCTCGCTTTACGCCCAATAAATCCGGACAACGCTTGCCACCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGTGG

benjjneb commented 7 years ago

The issue is that assignTaxonomy currently assumes all the input sequences are in 5'-3' orientation. I checked seq1, and the reverse complement of seq1:

unname(assignTaxonomy(c(seq1, dada2:::rc(seq1)), "~/tax/silva_nr_v123_train_set.fa.gz"))

[1,] "Eukaryota" "Opisthokonta" "Holozoa" "Metazoa_(Animalia)" "Eumetazoa" "Bilateria"
[2,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales" "Lactobacillaceae" "Lactobacillus"

The reverse complement is the correct assignment.

We do not currently have an automatic method in dada2 to fix this issue. One workaround is to classify both your sequences and the reverse-complement (dada2:::rc) of the sequences, and pick the one that's Bacteria.

benjjneb commented 7 years ago

There should be an option in assignTaxonomy that considers the input sequences in both orientations, and picks the one that assigns most accurately.

granek commented 7 years ago

Thanks for your quick response! I figured there was something simple I was missing.

I started working yesterday on picking between the sequence and its reverse complement. My plan is to pick the max of the rowSums of bootstrap values between the two sequence orientations. Not done yet, but I'm happy to share code once it is.

granek commented 7 years ago

Here is some wrapper code to do this. It is not pretty, but works with my limited testing. I am happy to try to incorporate this into taxonomy.R, but also happy to defer to you to do so.

library(dada2)

assignTaxonomyBothStrands = function(seqs, refFasta, outputBootstraps=FALSE, ...){
  wrc.tax = assignTaxonomy(append(seqs,dada2:::rc(seqs)),
                                 refFasta,
                                 outputBootstraps=TRUE,
                                 ...)
  wrc.bootsums = rowSums(wrc.tax[["boot"]])
  num.seqs = length(seqs)
  # seq.ids = names(rownames(wrc.tax[["boot"]]))
  best_indices = c()
  for (i in seq(num.seqs)){
    indices = c(i,i+num.seqs)
    # print(paste(indices, ":", seq.ids[indices]))
    best_index = indices[which.max(wrc.bootsums[indices])]
    # print(best_index)
    best_indices[i] = best_index
  }
  if (outputBootstraps==TRUE){
    return(list(tax=wrc.tax[["tax"]][best_indices,],
                boot=wrc.tax[["boot"]][best_indices,]))
  } else {
    return(wrc.tax[["tax"]][best_indices,])
  }
}
benjjneb commented 7 years ago

A new feature solving this issue was added in 83d6fa8a7bd494789d94c31678f314e29926c9c8

assignTaxonomy(..., tryRC=TRUE) will test the forward and RC versions of each sequence, and classify using whichever best matched the reference sequences.

This gives the expected classifications for @granek example data above