benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
464 stars 142 forks source link

Large amount of losses in read number during processing: PacBio Hifi reads #2009

Open HitMonk opened 2 weeks ago

HitMonk commented 2 weeks ago

Hello All! Im working with PacBio reads for the very first time. Compared to the tutorial, Im see im loosing really large number of reads. Below is the table:

Sample ccs primers filtered denoised
EC-CD16-16S 39295 29158 8115 2212
EC-CD24-16S 56231 41424 14767 9685
EC-CD8-16S 76440 57378 22369 14828

Additionally, why do i see a drop in read number post primer trimming?

Below are all the commands that I have used:

Primer trimming:

Give path and trim primers

F27 <- "AGRGTTYGATYMTGGCTCAG"
R1492 <- "RGYTACCTTGTTACGACTT"
nops <- file.path(path, "noprimers", basename(fns))
prim <- removePrimers(fns, nops, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE)
prim

Quality filtering and trimming:

filts <- file.path(path, "noprimers", "filtered", basename(fns))
filts <- file.path(path, "filtered_fastptrim", paste0(sample.names, "_filt.fastq.gz"))
out <- filterAndTrim(nops, filts, maxEE=2, rm.phix=TRUE, minQ=3,minLen = 1000,
                     maxN=0, compress=TRUE, multithread=TRUE)
out

STEP 2. Dereplicate the reads

drp_filts <- derepFastq(filts, verbose=TRUE)
names(drp_filts) <- sample.names

STEP 3. Learn Errors

err2 <- learnErrors(drp_filts, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, multithread=TRUE)

STEP 4. DADA2 main inference

dd2 <- dada(drp_filts, err=err2, BAND_SIZE=32, multithread=TRUE)

Is there anything that im doing wrong? Im open to all suggestions and advice from your side.

Best.

benjjneb commented 2 weeks ago

Some additional information would help in diagnosing your read loss:

  1. What sequencing instrument/chemistry generated this data?
  2. What environment are you sequencing?
  3. What is the output of plotQualityProfile on a typical sample?
  4. What is the output to the console of derepFastq(..., verbose=TRUE) on a typical sample?
HitMonk commented 1 week ago

Hello @benjjneb Thank you for your reply. Below are all the answers the to questions:

  1. What sequencing instrument/chemistry generated this

The samples were used to prepare a PacBio SMRTbell library using  AMPure PB bead-based purification and sequenced on a PacBio Sequel IIe system.

  1. What environment are you sequencing?

The samples are from a fermented plant roots, commonly used locally in daily diet.

  1. What is the output of plotQualityProfile on a typical sample?

PlotQualittyProfile before trimming: image

PlotQualityProfile after trimming: image

  1. What is the output to the console of derepFastq(..., verbose=TRUE) on a typical sample?

Ah! This was the other question that I had but forgot to mention in the first post. It seems like there are many unique sequences too - >95% of the data. Is it possible that the high error rate shows up as unique sequences?

drp_filts <- derepFastq(filts, verbose=TRUE)
Dereplicating sequence entries in Fastq file: 
Encountered 7880 unique sequences from 8115 total sequences read.
Dereplicating sequence entries in Fastq file: 
Encountered 13492 unique sequences from 14767 total sequences read.
Dereplicating sequence entries in Fastq file: 
Encountered 17853 unique sequences from 22369 total sequences read.

Thank you once again for your time and support. Please let me know if you would like any additional information from my side.

Best,