Loosing alot of reads - Githubissues

benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution

http://benjjneb.github.io/dada2/

GNU Lesser General Public License v3.0

471 stars 143 forks source link

Loosing alot of reads #1772

Closed MadsBjornsen closed 5 months ago

MadsBjornsen commented 1 year ago

Hi,

I am having some troubles with some 16S V3-V4 region paired-end reads. It has been pooled in 4 libraries which I have received back demultiplexed into the pools containing my samples (about 30 in each pool). I am using cutadapt to demultiplex the pools into samples and removing primers. But it seems like I am loosing alot of reads during this process, and ends up with poor quality of reads (see attached file) how would you recommend I trunLen these? Or would you recommend only using the forward reads and not merge them? I end up only merging a fracture of the original data, properly due to problems with the reads not overlapping.

I am using this script for the demultiplexing part: https://github.com/tobiasgf/lulu/blob/master/Files_LULU_manuscript/CLI%20scripts/DADA2_demultiplex_tail_trim.sh

QC_profile_SS.pdf

Thanks! Mads

benjjneb commented 1 year ago

I can't speak to the demultiplexing part. The quality profiles you provided look generally bad, and with hard drop-offs midway through that often indicate issue with the library preparation.

I would start by working with the forward reads only. I would also recall that "losing alot of [bad] reads" is no loss at all.

MadsBjornsen commented 1 year ago

Hi, thanks for quick answer!

Regarding only using the forward reads, would I then be able to merge that seqtab or ASV table with another sequencing runs table. The same samples have been sequenced twice, as we were having trouple with the former sequence, which have been solved now with good quality reads (atteched files). Here I have used truncLen (250,210) for the SS reads and (200,250) for the AS reads with both having maxEE at (2,2)

The data has by the way been generated from Miseq V3 2x300 bp, with 341F and 806R primer.

And you are right, it is not "bad" loosing reads of poor quality, we just lost almost all data, leaving none left to analyse further, when merging.

Thanks.

QC_profile_SS.pdf QC_profile_AS.pdf

benjjneb commented 1 year ago

Regarding only using the forward reads, would I then be able to merge that seqtab or ASV table with another sequencing runs table.

For simple merging, the ASVs need to cover the same gene region. So, if you use just the forward reads here, truncated at position XXX, then you would need to truncate the ASVs from the previous samples also at XXX in order to simply merge them with the new ASV table.

MadsBjornsen commented 1 year ago

Hi,

Sorry for the late response, and thank you for the feedback! I manage to go forward with the data, but came across an anomaly (I think) 1 ASV out of 3500 ASVs took up more than 20% of the total amount of reads and was wondering if it could have something to do with the merging of the 2 runs?

For seq1 i used truncLen (250,200) for the SS reads and (200,250) for the AS reads For seq2 i used truncLen (240,200) for the SS reads and (200,240) for the AS reads

In both cases I merged the forward and reverse reads fine.

Another question is that I am trying to understand when would you use in the dada function err and when would you use selfConsist?

Thanks

benjjneb commented 1 year ago

1 ASV out of 3500 ASVs took up more than 20% of the total amount of reads and was wondering if it could have something to do with the merging of the 2 runs?

I don't know why that would have anything to do with combining the runs. If anything, combining runs incorrectly will lead to ASVs being split into run-specific ASVs, which sounds the opposite of what you are seeing. And slightly different truncation lengths won't affect this either if you are merging successfully -- the truncation length doesn't affect the merged amplicon length (provided its long enough to allow for the overlap needed for merging).

Another question is that I am trying to understand when would you use in the dada function err and when would you use selfConsist?

Split the most time-consuming part of the workflow into two steps. And learnErrors also only uses part of the data to speed things up when there is more than enough.