benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

Issue merging reads across sequence runs and diagnosing sequence batch effects PacBio ITS Sequel II samples #1900

Open dawsonfairbanks opened 8 months ago

dawsonfairbanks commented 8 months ago

Hi, thank you for the great tools! I am having some issues with merging reads across sequence runs. I am working with PacBio ITS sequel II data. I've run the workflow through the DADA2 algorithm by individual sequence runs using cutadapt to trim primers and using the same QC parameters. The samples were merged using mergeSequenceTables(). I then removed chimeras on the joined sequence table and assigned taxonomy. I am getting strange output where samples cluster based on sequence run. The samples vary quite a bit in # of reads and read quality. I have 700+ samples from 2 sequencing centers across 6 sequence runs.

Trimming parameters:

filter_trackReads <- filterAndTrim(path.cut, filtFs, truncQ = 2, minQ= 3, minLen = 600, maxLen= 3000, maxEE = 2,# set to 2 multithread = TRUE, compress = TRUE, rm.phix = TRUE)

Cumulative expected errors across sequence run:

Screenshot 2024-03-03 at 1 30 11 PM

Read counts by run: Read_Count_Boxplot_Raw_AllData

Prefilter histogram of sequence lengths:

Screenshot 2024-03-03 at 1 29 44 PM

Error plot for example from one sequence run. Pretty similar across runs.

ErrorPlotsB1 1

I've tried a number of different filtering parameters and have gotten the same results varying maxEE+2-6, minQ and minLen. I am wondering if anyone else has seen significant batch effects in their samples with PacBio sequences. I've grouped all of the samples together and ran through the workflow and the batch effect significantly diminishes. Does anyone else have experience like this? I know that it is not recommended to run all samples together due to the differences in the assigned quality scores generated through learnErrorRates() but are there cases when it might be necessary? What parameters can I adjust or consider to mitigate this effect?

cjfields commented 7 months ago

@dawsonfairbanks do you know if the sequences are Q20 (99% base accuracy) or Q30 (99.9% base accuracy)? Looking at the error plot it looks more like 99%, but worth checking. For comparison, here is one from ours that is 99.9% (note the difference in the error frequency scale):

image

PacBio's default is 99%, but DADA2 works best with 99.9%. Your sequence vendors should hopefully provide you with both FASTQ and the PacBio BAM files; the run metrics, including the read quality, will normally only be in the BAM.

The BAM file can be parsed using samtools or similar to check for the read quality, stored in the rq tag; samtools view -c -e '[rq] < 0.999' my.bam would count the number of reads that are less than the expected DADA2 cutoff. For example:

$ samtools view -@ $SLURM_NTASKS -c -e '[rq] < 0.999' testbam.hifi_reads.bam
0
$ samtools view -@ $SLURM_NTASKS -c -e '[rq] >= 0.999' testbam.hifi_reads.bam
2265195