benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

index (forward and reverse) in merged output #890

Closed mattoslmp closed 4 years ago

mattoslmp commented 4 years ago

Dear Benjamin, I would like to ask a question about dada2,

I have a doubt about the step of merge: Example code: mergers_t1 <- mergePairs(dadaFs_t1, derep_forward , dadaRs_t1, derep_reverse, verbose=TRUE, trimOverhang=TRUE)

Example of part of my result, head (mergers):

abundance forward reverse nmatch nmismatch nindel prefer accept 6 176 3 6 100 0 0 2 TRUE 7 138 6 10 108 0 0 2 TRUE 8 137 13 12 104 0 0 2 TRUE 9 125 8 11 100 0 0 2 TRUE 10 125 1 137 102 0 0 1 TRUE 11 118 9 9 100 0 0 2 TRUE

I have distinct index (forward and reverse). Is this correct or any problem with it?

Thanks, Leandro

benjjneb commented 4 years ago

This is very hard to interpret. What is you are concerned about? I think that you are seeing a lot of failed merging? More than you expected?

If my guess is close than basic information is needed such as: What amplicon are you sequencing? What primer set? What trimming/truncation parameters did you use?

mdemmel commented 4 years ago

Hello,

I had the same question as described above, so I hoped I could elaborate on my own data to clarify. I have paired-end V4 16S reads sequenced on a 2x250bp Miseq platform. I am using the big data analysis pipeline, and up to 50% of my reads are lost during the merge step, even after experimenting with many different filterAndTrim parameters, including varying truncation lengths.

After looking at the top of the mergers object, I can see that many forward and reverse indices are mismatched:

Screen Shot 2019-11-19 at 9 51 27 AM

My understanding of the merge step is that the forward and reverse values should mostly be equivalent. Is this mismatch associated with the low yield of merged sequences, or am I misunderstanding what these forward and reverse indices represent?

Thanks!

benjjneb commented 4 years ago

My understanding of the merge step is that the forward and reverse values should mostly be equivalent.

Not necessarily. The index values represent the order in which each ASV was identified in the forward and reverse read data. While these often do line up fairly well for the most abundant ASVs, that correspondence can decay quickly, and there are lots of reasons that things get uncorrelated later on. So the kind of pattern you are seeing above does not indicate a problem, it looks pretty normal actually.

mdemmel commented 4 years ago

Ok, thanks very much. It's useful to know that the issue with merging is arising elsewhere.

rainbow218 commented 4 years ago

Dear @benjjneb ,

I have a few questions related to this topic. Is it normal to observe the number of unique ASVs after merging to be higher than the number of unique ASVs of dada forward and dada reverse object? Is information from fastq header of paired reads (ie. cluster coordinates within lane tiles) preserved and considered during the merging process?

thank you for your kind attention, Chan

benjjneb commented 4 years ago

@rainbow218

Is it normal to observe the number of unique ASVs after merging to be higher than the number of unique ASVs of dada forward and dada reverse object?

I'm not sure it's "normal", but it is not necessarily abnormal either. Especially with longer amplicons, there can be single forward ASVs that match with multiple reverse ASVs, because some variation exists only in the part of the gene covered by the reverse read.

Is information from fastq header of paired reads (ie. cluster coordinates within lane tiles) preserved and considered during the merging process?

No.

rainbow218 commented 4 years ago

@benjjneb

thank you for your response, I have a follow up question: How is the abundance of the merged ASVs determined? Is it based on the minimum ASV count out of the forward ASV and reverse ASV that is merged?

benjjneb commented 4 years ago

How is the abundance of the merged ASVs determined?

On a read-by-read basis. Each read-pair is denoised, then merged if they overlap (perfectly by default).

rainbow218 commented 4 years ago

@benjjneb

For this following example: image please correct me if my understanding is wrong: The abundance value of the first merged ASV is 24527, would either forward ASV index 1 or reverse ASV index 1 also have abundance of 24527? The other merged ASVs with forward ASV index 1 have different abundance values, hence those abundance values are coming from the Reverse ASVs?

Thank you for your kind attention!

benjjneb commented 4 years ago

would either forward ASV index 1 or reverse ASV index 1 also have abundance of 24527?

Probably neither. That abundance is the total number of paired reads where the forward member of the pair was denoised to FWD-ASV1 and the reverse reads was denoised to REV-ASV1.

It isn't clear from that data.frame, but those numbers come from read-by-read counting of the number of paired reads that were FWD-ASV1 -- REV-ASV1. The total number of forward reads denoised to FWD-ASV1, and the total number of reverse reads denoised to REV-ASV1, are probably different, since some of those will be paired to reads that denoised differently.