Large amount of losses in read number during processing: PacBio Hifi reads

HitMonk commented 2 weeks ago

Hello All! Im working with PacBio reads for the very first time. Compared to the tutorial, Im see im loosing really large number of reads. Below is the table:

Sample	ccs	primers	filtered	denoised
EC-CD16-16S	39295	29158	8115	2212
EC-CD24-16S	56231	41424	14767	9685
EC-CD8-16S	76440	57378	22369	14828

Additionally, why do i see a drop in read number post primer trimming?

Below are all the commands that I have used:

Primer trimming:

Give path and trim primers

F27 <- "AGRGTTYGATYMTGGCTCAG"
R1492 <- "RGYTACCTTGTTACGACTT"
nops <- file.path(path, "noprimers", basename(fns))
prim <- removePrimers(fns, nops, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE)
prim

Quality filtering and trimming:

filts <- file.path(path, "noprimers", "filtered", basename(fns))
filts <- file.path(path, "filtered_fastptrim", paste0(sample.names, "_filt.fastq.gz"))
out <- filterAndTrim(nops, filts, maxEE=2, rm.phix=TRUE, minQ=3,minLen = 1000,
                     maxN=0, compress=TRUE, multithread=TRUE)
out

STEP 2. Dereplicate the reads

drp_filts <- derepFastq(filts, verbose=TRUE)
names(drp_filts) <- sample.names

STEP 3. Learn Errors

err2 <- learnErrors(drp_filts, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, multithread=TRUE)

STEP 4. DADA2 main inference

dd2 <- dada(drp_filts, err=err2, BAND_SIZE=32, multithread=TRUE)

Is there anything that im doing wrong? Im open to all suggestions and advice from your side.

Best.

benjjneb commented 2 weeks ago

Some additional information would help in diagnosing your read loss:

What sequencing instrument/chemistry generated this data?
What environment are you sequencing?
What is the output of plotQualityProfile on a typical sample?
What is the output to the console of derepFastq(..., verbose=TRUE) on a typical sample?

HitMonk commented 1 week ago

Hello @benjjneb Thank you for your reply. Below are all the answers the to questions:

What sequencing instrument/chemistry generated this

The samples were used to prepare a PacBio SMRTbell library using AMPure PB bead-based purification and sequenced on a PacBio Sequel IIe system.

What environment are you sequencing?

The samples are from a fermented plant roots, commonly used locally in daily diet.

What is the output of plotQualityProfile on a typical sample?

PlotQualittyProfile before trimming:

PlotQualityProfile after trimming:

What is the output to the console of derepFastq(..., verbose=TRUE) on a typical sample?

Ah! This was the other question that I had but forgot to mention in the first post. It seems like there are many unique sequences too - >95% of the data. Is it possible that the high error rate shows up as unique sequences?

drp_filts <- derepFastq(filts, verbose=TRUE)
Dereplicating sequence entries in Fastq file: 
Encountered 7880 unique sequences from 8115 total sequences read.
Dereplicating sequence entries in Fastq file: 
Encountered 13492 unique sequences from 14767 total sequences read.
Dereplicating sequence entries in Fastq file: 
Encountered 17853 unique sequences from 22369 total sequences read.

Thank you once again for your time and support. Please let me know if you would like any additional information from my side.

Best,

benjjneb / dada2