Issue merging reads across sequence runs and diagnosing sequence batch effects PacBio ITS Sequel II samples

Hi, thank you for the great tools! I am having some issues with merging reads across sequence runs. I am working with PacBio ITS sequel II data. I've run the workflow through the DADA2 algorithm by individual sequence runs using cutadapt to trim primers and using the same QC parameters. The samples were merged using mergeSequenceTables(). I then removed chimeras on the joined sequence table and assigned taxonomy. I am getting strange output where samples cluster based on sequence run. The samples vary quite a bit in # of reads and read quality. I have 700+ samples from 2 sequencing centers across 6 sequence runs.

Trimming parameters:

filter_trackReads <- filterAndTrim(path.cut, filtFs, truncQ = 2, minQ= 3, minLen = 600, maxLen= 3000, maxEE = 2,# set to 2 multithread = TRUE, compress = TRUE, rm.phix = TRUE)

Cumulative expected errors across sequence run:

Read counts by run: Read_Count_Boxplot_Raw_AllData

Prefilter histogram of sequence lengths:

Error plot for example from one sequence run. Pretty similar across runs.

ErrorPlotsB1 1

I've tried a number of different filtering parameters and have gotten the same results varying maxEE+2-6, minQ and minLen. I am wondering if anyone else has seen significant batch effects in their samples with PacBio sequences. I've grouped all of the samples together and ran through the workflow and the batch effect significantly diminishes. Does anyone else have experience like this? I know that it is not recommended to run all samples together due to the differences in the assigned quality scores generated through learnErrorRates() but are there cases when it might be necessary? What parameters can I adjust or consider to mitigate this effect?

@dawsonfairbanks do you know if the sequences are Q20 (99% base accuracy) or Q30 (99.9% base accuracy)? Looking at the error plot it looks more like 99%, but worth checking. For comparison, here is one from ours that is 99.9% (note the difference in the error frequency scale):

PacBio's default is 99%, but DADA2 works best with 99.9%. Your sequence vendors should hopefully provide you with both FASTQ and the PacBio BAM files; the run metrics, including the read quality, will normally only be in the BAM.

The BAM file can be parsed using samtools or similar to check for the read quality, stored in the rq tag; samtools view -c -e '[rq] < 0.999' my.bam would count the number of reads that are less than the expected DADA2 cutoff. For example:

$ samtools view -@ $SLURM_NTASKS -c -e '[rq] < 0.999' testbam.hifi_reads.bam
0
$ samtools view -@ $SLURM_NTASKS -c -e '[rq] >= 0.999' testbam.hifi_reads.bam
2265195

benjjneb / dada2

Issue merging reads across sequence runs and diagnosing sequence batch effects PacBio ITS Sequel II samples #1900