Different sequence runs with different sequence lengths

benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution

http://benjjneb.github.io/dada2/

GNU Lesser General Public License v3.0

468 stars 142 forks source link

Different sequence runs with different sequence lengths #1570

Closed larajan67 closed 4 months ago

larajan67 commented 2 years ago

Hello, I am relatively new to DNA sequencing analyses so I appreciate the detailed tutorials and feedback on previous issues.

I have 16s rRNA data from 3 different sequence runs using the same primers. I have run all runs through the dada2 workflow using the same parameters for the 'FilterandTrim' step. 2 of the runs have reads mostly with a sequence length of ~214, while the 3rd run has reads mostly with a sequence length of ~239. I want to merge these runs together into a single dataset for downstream analysis, yet this difference of sequence lengths is an issue (see graph). readsbylength

What can I do to resolve this?

benjjneb commented 2 years ago

Do these different sequencing runs use the same library preparation approach?

Some amplicon library preps will include the primers on the sequenced part of the reads, while others will not. Even if the primers used are the same, these different library prep approaches will lead to difference in the sequenced amplicon lengths of the total length of the two primers, which is usually ~30 nts.

That's my guess for what is going on here. It's fixable, but first it would be good to confirm or disprove that possibility. One can do this by simply inspecting the raw sequences from each run... is the primers sequence present at the start of the reads or not?

larajan67 commented 2 years ago

I have confirmed with my collaborators, who led the sequencing, that same library preparation approach was used.

I did find that the specific cutadapt code did not remove the primers in this 3rd run as it had for the first 2 runs. I slightly edited the cutadapt code and the primers were mostly removed (less than 1% remaining).

However, there remains still a difference in sequence length between the 3 runs: 1 & 2-214; 3-231 (see graph). Do you have any other ideas? Thank you for your help. lengthbyreads

benjjneb commented 2 years ago

I'd recommend removing the cutadapt step that seems to be introducing this spurious length variation. Just use trimLeft in filterAndTrim to remove the primers instead.

larajan67 commented 2 years ago

I used the trimLeft in filterAndTrim instead of cutadapt. However, the sequence length difference remains with a difference of 14 between runs 1/2 and run 3. Any other ideas? Thank you, Lara

benjjneb commented 2 years ago

I would try aligning the most abundant ASVs in run 1 vs. run 3, and trying to identify where the additional sequence length observed in run 3 might be. I'm still betting it is some technical issue, but maybe not?

larajan67 commented 2 years ago

I apologize for the delayed response. I had another project with higher priority.

As I am relatively new to dada2, I would like to double check my approach for alignment of specific sequences. Would this be the best approach after filtering for the most abundant ASVs? (see below) sequences<-getSequences(seqtab.nochim_most) alignment <- AlignSeqs(DNAStringSet(sequences), anchor=NA) phang.align <- phyDat(as(alignment, "matrix"), type="DNA")

Then I would examine the phang.align output?

benjjneb commented 2 years ago

That works. The signal we are looking for here is pretty crude (where are the extra bases in run 3 vs run 1?), so any approach that lets you view the alignemnt of a few top ASVs from each run should do the trick.

larajan67 commented 2 years ago

Here are snapshots from the output of 'BrowseSeqs' for Run 1 (top) and Run 3 (bottom) for the most abundant ASVs. What do you recommend? Screen Shot 2022-08-06 at 3 43 14 PM