benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

justConcatenate vs. overlap mergePairs formatting issue #373

Closed rnettles04 closed 6 years ago

rnettles04 commented 6 years ago

I have sequenced the fungal ITS2 region and am attempting to analyse it using dada2. Unfortunately, many of my reads do not overlap, so I would like to use justConcatenate = TRUE when running the mergePairs command. Because I ran into downstream errors with the concatenated data, I decided to see if I encountered any errors when I mergePairs() via overlap. The overlapping data worked, while the concatenated data continues to give me an error at the removeBimeraDenovo() step, by giving me the following error: "Error in isBimeraDenovoTable(unqs[[i]], ..., verbose = verbose) : Input must be a valid sequence table."

The sequence of steps I am using is:

  1. mergePairs() with concatenation
  2. makeSequenceTable() with merged output
  3. removeBimeraDenovo() with sequence table

When I run mergePairs() with concatenated versus overlapping methods I get outputs that seem to be formatted similarly. The only difference I see is sporadic spacing. However, when I run makeSequenceTable() with overlapping merging, it gives me differently formatted output than the concatenated data does. The overlapping output successfully runs through removeBimeraDenovo(), while the concatenated output gives me an error at that step.

Is there some way I can fix this formatting issue so I can use the concatenated data for downstream analysis?

I have included some of my code, and the relevant file formats for those interested. Thanks!

Format of the OVERLAPPING reads after merging and creating a sequencing table:

mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose = TRUE) head(mergers)

$R840 sequence 1TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAGTGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCATGGGGCATGCCTGTTCGAGCGTCATTTGTACCTTCAAGCTCTGCTTGGTGTTGGGTGTTTGTCTCGCCTCTGCGTGTAGACTCGCCTTAAAACAATTGGCAGCCGGCGTATTGATTTCGGAGCGCAGTACATCTCGCGCTTTGCACTCATAACGACGACGTCCAAAAGTNNNNNNNNNNAGCTCTGCTTGGTGTTGGGTGTTTGTCTCGCCTCTGCGTGTAGACTCGCCTTAAAACAATTGGCAGCCGGCGTATTGATTTCGGAGCGCAGTACATCTCGCGCTTTGCACTCATAACGACGACGTCCAAAAGTACATTTTTACACTCTTGACCTCGGATCAGGTAGGGATACCCGCTGA 2 T GGTTCTCGCATCGATGAAGAACGTAGCAAAGTGCGATAACTAGTGTGAATTGCATATTCAGTGAATCATCGAGTCTTTGAACGCAGCTTGCACTCTATGGTTTTTCTATAGAGTACGCCTGCTTCAGTATCATCACAAACCCACACATAACATTTGTTTATGTGGTGATGGGTCGCATCGCTGTTTTATTACAGTGAGCACCTAAAATGTGTGTGATTTTCTGTCTGGCTTGCTAGGCAGGAATATTACGCTGGTCTCAGGATCTTTTTTTTTGGTTNNNNNNNNNNTTTTTTTGGTTCGCCCAGGAAGTAAAGTACAAGAGTATAATCCAGTAACTTTCAAACTATGATCTGAAGTCAGGTGGGATTACCCGCTGA 3
ACTTTATTTATTAAGATAGCTAATCACTTAAGCATATCAATAAGCGG AGGCTTTAGACCAAGTCTCTGCTACCGTACCGTTTGTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAANNNNNNNNNNAGCTCTGCTTGGTGTTGGGTGTTTGTCTCGCCTCTGCGTGTAGACTCGCCTTAAAACAATTGGCAGCCGGCGTATTGATTTCGGAGCGCAGTACATCTCGCGCTTTGCACTCATAACGACGACGTCCAAAAGTACATTTTTACACTCTTGACCTCGGATCAGGTAGGGATACCCGCTGA 4TGGTTCTCGCATCGATGAAGAACGTAGCAAAGTGCGATAACTAGTGTGAATTGCATATTCAGTGAATCATCGAGTCTTTGAACGCAGCTTGCACTCTATGGTTTTTCTATAGAGTACGCCTGCTTCAGTATCATCACAAACCCACACATAACATTTGTTTATGTGGTGATGGGTCGCATCGCTGTTTTATTACAGTGAGCACCTAAAATGTGTGTGATTTTCTGTCTGGCTTGCTAGGCAGGAATATTACGCTGGTCTCAGGATCTTTTTTTTTGGTTNNNNNNNNNNAGCTCTGCTTGGTGTTGGGTGTTTGTCTCGCCTCTGCGTGTAGACTCGCCTTAAAACAATTGGCAGCCGGCGTATTGATTTCGGAGCGCAGTACATCTCGCGCTTTGCACTCATAACGACGACGTCCAAAAGTACATTTTTACACTCTTGACCTCGGATCAGGTAGGGATACCCGCTGA abundance forward reverse nmatch nmismatch nindel prefer accept 1 128 1 1 0 0 0 NA TRUE 2 29 2 2 0 0 0 NA TRUE 3 17 3 1 0 0 0 NA TRUE 4 1 2 1 0 0 0 NA TRUE

seqtab <- makeSequenceTable(mergers) head(seqtab)

TGGTTCTCGCATCGATGAAGAGCGCAGCGAAATGCGATACCTAGTGTGAATTGCAGCCATCGTGAATCATCGAGTTCTTGAACGCACATTGCGCCCTTCGGTATTCCGGAGGGCATGCCTGTTTGAGCGTCGTTTCCTTCTTGCGCCAGCGCAGAGTTGGAGGGGTGTAGTGCCCTTCTGAAAAGAAACGTGCGGGTGAAGCGAACTATATTGGGACGCTTGGCCGCCGAACTTTTAATAAGCTCGACCTCAAATCAGGTAGGAATACCCGCTGAACT Plate1Neg

0 TGGCTCTCGCATCGATGAAGAACGCAGCGAAATGTGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCATCTTGCACTCTCTGGTATTCCGGAGAGTATGTCTGTTTGAGTGTCATGAATTCTTCAACCCTCCCTTTTCTTAGTGAATTGAGAGGAGTTTGGATTCTGAGTGTTGCTCCTAACCCGAGCTCATTCGTAATATATTAGCATCCATATTCGACTTCGGATTGACTTGGCGTAATAGACTATTCGCTGAGGAATCTAACTTCGGTTAGAGCCGTTTTTGAACTAGGAAGCTGCTAATCTAGCTTAGTCTACCTTTAGATTAGATCTCAAATCAGATAGGATTACCCGCTGA Plate1Neg

0

TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGCGAATTGCAGAATTTCCGTGAGTCATCGAATCTTTGAACGCACATTGCGCCCACTGGTATTCCGGTGGGCATGCCTGTTCGAGCGTCATTATCCTCCCTCAAACCCCGGGTTTGGTGTTGGACCGAAGTTGTGTGAACAACTGGTCTAAAAGACAATGACGGCGTCCGTGGGACCCTCGGTGCAACGAGCTTTTAGGAGCACGCGCCGAGTTGCAAGGACCTTCCGGGCCGGTCTCCTTTTACTATTTTACAAGGTTGACCTCGGATCAGGTAGGAATACCCGCTGA Plate1Neg

0

seqtab.nochim <- removeBimeraDenovo(seqtab, method = "consensus", multithread = TRUE, verbose = TRUE) dim(seqtab_overlap.nochim) [1] 368 549

Format of the CONCATENATED reads after merging and creating a sequencing table:

mergers2 <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, justConcatenate = TRUE, verbose = TRUE) head(mergers2)

$R836 sequence 1 TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGAT AAGTAGTGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCATGGGGCATGCCTGTTCGAGCGTCATTTGTACCTTCAAGCTCTGCTTGGTGTTGGGTGTTTGTCTCGCCTCTGCGTGTAGACTCGCCTTAAAACAATTGGCAGCCGGCGTATTGATTTCGGAGCGCAGTACATCTCGCGCTTTGCACTCATAACGACGACGTCCAAAAGTACATTTTTACACTCTTGACCTCGGATCAGGTAGGGATACCCGCTGA 4 TGGTTCTGGCATCGATGAAGAAC GCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCGTGGTATTCCTCGGGGCATGCCTGTTCGAGCGTCATTTCAACCCATCAAGCTTCGGCTTGGTCTTGGGGCCTGCGATTCCGCAGCCTCTAAACCCAGTGGCGGTGCCCTCGAGCTCTGAGCGTAGTAATTCTTCTCGCTATAGGGTCTCGGTGGTATCTTGCCAATAACCCCCAATTTTTATAAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGA 5 TGGTTCTCGCATCGATGAAGAACGTAGCAAAGTGCGATAACTAGTGTGAATTGCATA TTCAGTGAATCATCGAGTCTTTGAACGCAACTTGCGCTCATTGGTATTCCAATGAGCACGCCTGTTTCAGTATCAAAACAAACCCTCTATTCAACTTTTGTTGTATAGGATTATTGGGGGCCTCTCGATCTGTATAGATCTTGAAACCCTTGAAATTTACTAAGGCCTGAACTTGTTTAAATGCCTGAACTTTTTTTTAATATAAAGGAAAGCTCTTGTAATTGACTTTGATGGGGCCTCCCAAATAAATCTCTTTTAAATTTGATCTGAAATCAGGCGGGATTACCCGCTGA 7 TGGCTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCAT CGAATCTTTGAACGCACATTGCGCCCTCTGGTATTCCGGAGGGCATGCCTGTTTGAGTGTCATTAATTACTCACCAAACCTTTAAGTTTCAAACTTGTGGTGCGGTTGATGTTGGGGTCTGCAAGCTTTTATAACTTGCTTCCCTTAAAAGCATTGGTGCGGCCTGCTGCACGGTTTACACAACGTTCTAGGTTCATTCCAACTCGTTGCTTGACCGGGCAAACTTTTGTGCTGCACCTTAAACCCGTTTCTAGCTTTATGCTTTGAACCTCTTCAACTATTGACCTCAGATCAGGTAGGAATACGCGCTGA abundance forward reverse nmatch nmismatch nindel prefer accept 1 320 1 2 129 0 0 2 TRUE 4 91 2 1 128 0 0 2 TRUE 5 84 3 5 100 0 0 1 TRUE 7 30 8 7 97 0 0 2 TRUE

seqtab2 <- makeSequenceTable(mergers2) head(seqtab2)

TGGCTCTCGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACCTTGCGCCTTTTGGTATTCCGAAAGGCATGCCTGTTTGAGTGTCATGAAATATCAACCCCTCCTGGTTTCTGATCAGTGTTGGGCTTGGACTTGGGTGTCTGCCAGTTCGCTGGCTCGCCTTAAAAGAGTTAGTGGTAATAACATCCACGGCTAAGACGTAATAAGTTTCGTCTGGTCAAGGTGTGACGANNNNNNNNNNATCGGTGGTTTTTTTGCCCTCGTTCCTCGTCGTGCGTGCCCCGTCGCCCGAACGCGCTCTTACGACCCTCACGCGTCGCCTCGGCGGCGCTCCCAACGCGACCCCAGGTCAGGCGGGACTACCCGCTGA

TGGTTCTCGCATCGATGAAGAACGTAGCAAAGTGCGATAACTAGTGTGAATTGCATATTCAGTGAATCATCGAGTCTTTGAACGCAACTTGCGCTCATTGGTATTCCAATGAGCACGCCTGTTTCAGTATCAAAACAAACCCTCTATCCAACTTTTGTTGTATAGGATTATTGGGGGCCTCTCGATCTGTATAGATCTTGAAATCCCTGAAATTTACTAAGGCCTGAACTTGTTTAAATGCCTGAACTTTTTTTTAATATAAAGGAAAGCTCTTGTAANNNNNNNNNNCTGCGTGCTTGAGGTATAATGCAAGTACGGTCGTTTTAGGTTTTACCAACTGCGGCTAATCTTTTTTATACTGAGCGTATTGGAACGTTATCGATAAGAAGAGAGCGTCTAGGCGAACAATGTTCTTAAAGTTTGACCTCAAATCAGGTAGGAGTACCCGCTGA

TGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAGTGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCATTGGTATTCCAATGGGCATGCCTGTTCGAGCGTCATTTGTACCCTCAAGCTTTGCTTGGTGTTGGGCGTTTGTCTTTCGAGACTCGCCTTAAAAAGATTGGCAGCCGGCATACTGGTTTCGGAGCGCAGCACAAATTGCGGTCTGTTCCTGAATGTTGACGTCCATGAAGCCCCCATTATNNNNNNNNNNCTGCGTGCTTGAGGTATAATGCAAGTACGGTCGTTTTAGGTTTTACCAACTGCGGCTAATCTTTTTTATACTGAGCGTATTGGAACGTTATCGATAAGAAGAGAGCGTCTAGGCGAACAATGTTCTTAAAGTTTGACCTCAAATCAGGTAGGAGTACCCGCTGA

seqtab2.nochim <- removeBimeraDenovo(seqtab2, method = "consensus", multithread = TRUE, verbose = TRUE)

Error in isBimeraDenovoTable(unqs[[i]], ..., verbose = verbose) : Input must be a valid sequence table.

benjjneb commented 6 years ago

The problem is that seqtab2 is not a sequence table (a matrix in R) but instead is just a character vector of sequences.

What version of the package are you using? packageVersion("dada2")

And when you are using justConcatenate=TRUE, are you using multiple samples, or just one sample?

rnettles04 commented 6 years ago

The dada2 version I have is 1.4.0

When I use justConcatenate = TRUE I am running multiple samples. Do you have any suggestions to fix this?

Thanks!

benjjneb commented 6 years ago

It's not clear to me how this could be happening, so I would need to reproduce the error on my end. Could you share a .rds object of mergers2? I.e. the file created by:

saveRDS(mergers2, "path/to/save/mergers2.rds")

It's also possible that upgrading to the current release (1.6.0) might solve the issue.

rnettles04 commented 6 years ago

I have updated my R and the dada2 package. It fixed the problem. Thanks!

rnettles04 commented 6 years ago

I have a few other questions that I am hoping you can help me with. In the assignTaxonomy function in dada2, how is it assigning taxonomy? I understand it is using a kmer approach, but is there a similarity threshold involved (like the standard 97% similarity)? I believe the wording indicates that the kmer approach generates a consensus taxonomy. How does that work, and what is the similarity requirement for taxonomy assignment?

In addition, does dada2 have the ability to rarefy the resulting taxonomic sequence table?

benjjneb commented 6 years ago

assignTaxonmy is just a reimplementation of the naive Bayesian classifier method described in this paper by Wang et al 2007: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1950982/

In addition, does dada2 have the ability to rarefy the resulting taxonomic sequence table?

dada2 doesn't perform rarefaction, but you can do that via other facilities in R, such as the phyloseq package (e.g. phyloseq::rarefy_even_depth, as shown in this tutorial)