benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
474 stars 144 forks source link

Merge ASV tables after using trimRight #2036

Closed ma-n-on closed 2 weeks ago

ma-n-on commented 1 month ago

Hi

I have data from multiple Illumina MiSeq runs, that I want to analyze together. Some of the runs were sequenced with a v2 kit (2x250 cyc), some with a v3 kit (2x300 cyc). All of the runs cover the 16s V4V5 (approx. 411bp) region and for all runs the same fwd and rev primers were used. My idea is to follow the DADA2 pipeline up till and with ASV table construction, but without chimera removal for each run separately and than merge the asv tables. Below I put some of the code I used so far.

Code used for parameter setting: out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft = c(19,20), trimRight = c(10,50), maxN=0, maxEE=c(2,4), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE) with trimLeft equal for all runs (used to cut the primers) and trimRight run specific

Code used for asv table merging: st.all <- mergeSequenceTables(st.run1, st.run2)

Code used after merging: col.st <- collapseNoMismatch(st.all, minOverlap = 200, verbose = T)

I read through a lot of the threads but am unfortunately still confused with what option to use best. Should I trim /truncate the reads to equal length in the separate asv tables, before merging? Would you recommend to use truncLen instead of trimRight? Thank you for any input / idea on how to handle my data.

benjjneb commented 1 month ago

Would you recommend to use truncLen instead of trimRight?

Yes.

col.st <- collapseNoMismatch(st.all, minOverlap = 200, verbose = T)

This minOverlap setting seems excessive, and runs the risk of amplicons failing to merge just because they were truncated too short to reach that much overlap.

Should I trim /truncate the reads to equal length in the separate asv tables, before merging?

You should use the same trimLeft consistently, but don't have to use the same truncLen (or trimRight) for each run, because after appropriate merging, the truncLen doesn't affect the start/stop sites of the merged amplicon. A simple example, forward (F) reverse (R) reads with different truncation lengths yielding the same merged (M) read.

F: ACGTACGT
R:     ACGTTGCA
M: ACGTACGTTGCA

F: ACGTACGTTG
R:        TTGCA
M: ACGTACGTTGCA
ma-n-on commented 1 month ago

Thank you very much for your answer, this helped already a lot.

Initially, I chose to use trimRight in order to retain shorter sequencing fragments, rather than discarding them with truncLen.

Do you think there will be a problem, if I subsequently apply collapseNoMismatch with a low minOverlap value? I am thinking about a scenario where a sequencing fragments, say 100nts, is identical to a much longer fragment, leading to the collapse of the two. However, there is a possibillity that the shorter fragment could represent a different taxon. (for reference, most of the sequences in my merged asv table are around 370bp)

Perhaps I am miss-understanding how the minOverlap argument works ... As I understand it, the default value of 20 means that min 20 nts must be identical between sequences in order for them to be collapsed. This would imply that the algorithm checks in each sequence, whether there is a sequence of 20nts or longer that also appears in another sequence in the table. If found, it merges the sequences into the more abundant one.

benjjneb commented 1 month ago

Perhaps I am miss-understanding how the minOverlap argument works ... As I understand it, the default value of 20 means that min 20 nts must be identical between sequences in order for them to be collapsed. This would imply that the algorithm checks in each sequence, whether there is a sequence of 20nts or longer that also appears in another sequence in the table. If found, it merges the sequences into the more abundant one.

Yes you are misunderstanding minOverlap.

minOverlap applies to merging the forward and reverse reads of paired reads. It sets the minimum number of nucleotides in the "overlap" region after aligning the forward/reverse pairs together to keep the merged pair. It has nothing to do with how prefixes (shorter versions of a longer sequence) treated in the final sequence table.

ma-n-on commented 1 month ago

Ok, in this case I understand why 20 as value for the minOverlap argument makes sense. Also, I understand that the collapseNoMismatch function does the merging of the reverse and forward reads of paired reads. Correct?

For the separate runs I followed the DADA2 pipeline tutorial, where after learning the error rates the paired reads are merged and afterwards the sequence table is constructed. Therefore the sequence tables I used as input for the mergeSequenceTables function contain the merged paired reads. Is that alright? Or should the paired reads not be merged when planning to use the mergeSequenceTables and collapseNoMismatch functions subsequently?

benjjneb commented 1 month ago

Also, I understand that the collapseNoMismatch function does the merging of the reverse and forward reads of paired reads. Correct?

No, mergePairs is the function that merges forward and reverse reads.

collapseNoMismatch is not usually needed. It is a function run on the final sequence table that will collapse together ASVs that differ only in their length.

Therefore the sequence tables I used as input for the mergeSequenceTables function contain the merged paired reads. Is that alright?

Yes, that is the intended usage.