Almost all reads lost when applying dada() with PacBio HiFi reads

BjnNowak commented 2 years ago

Hi Benjamin

Recently, a PhD student asked me to help her write the script to process her data (I am not a bioinformatician but I know how to use R).

The objective of this project is to identify the soil microbiome of wheat plots. To do this, samples were taken from a hundred of wheat plots and, after extraction, 16S genes were amplified following PacBio HiFi reads procedure (more details here )

The script that I wrote to process this data is mostly based on your tutorial for Big Data analysis. My script is available here. But, after this script has run on the server, a significant number of samples come back with 0 reads at the end of the pipeline.

As I do not have the computing power to treat all samples with my computer, I run the whole pipeline for only two samples that came back with 1 and 0 reads respectively (fastq files for those samples are available here). Primers have been removed from the samples. On my computer, I use R 4.2 and your latest dada2 release (1.20).

Turns out that almost all reads are lost when applying the dada() function.

Also, the error function seems odd to me (eg sometimes increasing with the quality score)

I may do something really stupid with this pipeline that may explain this, but I read in some of your comments that strange things may happen with PacBio HiFi reads and dada2, so I wanted to ask you if you knew what could happen here?

Anyway, thanks for your works and great tutorials!

Best,

Benjamin

benjjneb commented 2 years ago

Hi Benjamin, Pre-processing of PacBio HiFi reads has some critical differences compared to the pre-processing of Illumina reads, as is assumed in the Big Data tutorial. In particular, it is necessary to remove primers and to orient the reads in a consistent direction. So, for PacBio data the reference workflow you should be using is the one associated with our publication. You can see those workflows at the GH repo associated with the publication: https://github.com/benjjneb/LRASmanuscript

In particular, the "Analysis of the Zymo mock community." and "Analysis of Replicate Fecal Samples" pages should get you up and running. They include the critical removePrimers step that also orients the reads in a consistent direction.

BjnNowak commented 2 years ago

Hi,

Thanks for your return! I read your paper and tutorial previously about PacBio HiFi reads, but I did not follow strictly this example because, in my case :

Primers have already been removed from the sample (using PacBio tools) ;
I have 2 different files for each sample (forward and reverse)

So it seemed to me that my case was closer to the example of the big data analysis tutorial. In particular, I don't need to use the removePrimers function (as the primers have already been removed).

Would it be possible to use the dada() function starting from PacBio HiFi reads, without using the removePrimers() function? You may have two examples of my input files here.

Best,

benjjneb commented 2 years ago

It seems that many of your reads, probably most, still have primers on them. For example, when I run the following code, guessing that you are using a popular forward primer for full-length 16S sequencing:

library(ShortRead)
fnF <- "~/Desktop/demux.bc1008--bc1045.fwd.fastq.gz"
primer <- "AGAGTTTGATCMTGGC" # One common full-length 16S primr
nhits <- vcountPattern(primer, sread(readFastq(fnF)), fixed = FALSE, max.mismatch=1)
table(nhits>0)

FALSE TRUE 1400 437

So many of your reads are matching that primer sequence. The reverse reads show the same pattern. There is also a strangeness about why there would be the exact same number of forward and reverse reads in your sequencing data... those exact numbers should be random with equal means, not exactly equal.

I would recommend obtaining the HiFi reads after demultiplexing, but before whatever other pre-processing was performed to (not) remove primers and split the reads into forward and reverse orientations, and just use the removePrimers function on the sequences as in the example workflows instead.

BjnNowak commented 2 years ago

Exact number of reads in fwd and rev was part of the pre-processing steps: we retained only reads present in both sides.

Kind of ironic but actually we used the same primers for this project as you did for your zymo tutorial, so now that you pointed me this issue it was quite straightforward to adapt this tutorial... I performed a test on the fwd files only (before working with HiFi reads just after demultiplexing) and it works pretty well! As we are working with soil samples, with high diverstity, the only change I made for this test is setting OMEGA_C at 0 (as you suggested in this post). I was still loosing too many reads along the pipeline otherwise.

So it seems OK for me now, many thanks for your reactivity!

benjjneb / dada2

Almost all reads lost when applying dada() with PacBio HiFi reads #1569