benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

PacBio Revio Kinnex data #1892

Open cjfields opened 4 months ago

cjfields commented 4 months ago

We're just getting in our first standard 16S data from the PacBio Revio which is using the Kinnex kit, and will soon be getting 16S-ITS-23S data. I know the FASTQ data are binned now (similar to NovaSeq I believe). Is anyone aware of issues with these? I'm more concerned about issues similar to the NovaSeq binning issues documented elsewhere:

I'm also wondering whether the kit itself (which concatenates the reads prior to sequencing, and then splits the resulting reads after sequencing based on PacBio adapters) may also affect the error profile.

cjfields commented 4 months ago

It does appear there are issues with learnErrors, at least when using our workflow. We run this script:

filts <- list.files('.', pattern="filtered.fastq.gz", full.names = TRUE)
sample.namesF <- sapply(strsplit(basename(filts), "_"), `[`, 1) # Assumes filename = samplename_XXX.fastq.gz
set.seed(100)

# Learn forward error rates
errs <- learnErrors(filts, 
    errorEstimationFunction=PacBioErrfun, 
    BAND_SIZE=32, 
    multithread=4)

pdf(paste0("R1",".err.pdf"))
plotErrors(errs, nominalQ=TRUE)
dev.off()

The error appears to be due to the binning. From our Nextflow output

Command exit status:
  140

Command output:
  213565386 total bases in 147764 reads from 1 samples will be used for learning the error rates.                                   

Command error:
  The max qual score of 93 was not detected. Using standard error fitting.                                                          
  The max qual score of 93 was not detected. Using standard error fitting.                                                          
  The max qual score of 93 was not detected. Using standard error fitting.  

We're working through the workflow using R+dada2 and a small data set to manually diagnose it and see if we can create a workaround.

cjfields commented 4 months ago

It looks like the issue above was a workflow problem (the task timed out). However we are seeing increased loss of sequence data compared to past runs, and there are similar issues with the binning step as seen for the NovaSeq runs interestingly enough. I suspect this might be from differences in estimated error as the problem occurs after filterAndTrim. The error profile also looks like an issue, similar to the dip seen in NovaSeq runs.

Attached are the results from three random samples from a larger 174 sample run using plotQualityProfile (you can ignore the 'aggregate' labels, these are from individual samples). Lots of read data per sample. I'll see if I can dig up some comparative standard Sequel IIe runs.

aggregate-qualities.12.pdf aggregate-qualities.27.pdf aggregate-qualities.77.pdf

This is the error profile:

image

Note the dip.

EDIT: I am a bit concerned about the 'flatness' of that error frequency line compared to Illumina and older PacBio data, like from @benjjneb 's past runs, e.g. https://benjjneb.github.io/LRASManuscript/LRASms_fecal.html.

cjfields commented 4 months ago

We just tried a run using the alternative function for error estimation provided by @jonalim and tested by @hhollandmoritz here: https://github.com/benjjneb/dada2/issues/1307#issuecomment-957680971 (option 4). This seems to help, though the data are pretty sparse; we're working on newer runs with more nbases to see if this improves anything

image
cjfields commented 4 months ago

This is an updated plot with nbases=1e10

image

I should mention that the PacBio workflow steps for Kinnex still appear to use the original PacBioErrFunc. We are downloading their Zymo mock samples to see if the pattern above holds:

https://downloads.pacbcloud.com/public/dataset/Kinnex-16S/DATA-ZymoATCCMix-32plex-Revio/2-Demux/fastq_files/

I should also mention these data (and our in-house data) are from the Revio, I'll update the title.

cjfields commented 4 months ago

Error profiles for the mock community samples from here:

https://downloads.pacbcloud.com/public/dataset/Kinnex-16S/DATA-ZymoATCCMix-32plex-Revio/2-Demux/fastq_files/

The original PacBio model, PacBioErrfun

code:

errs <- learnErrors(dereps, 
    nbases = 1e9, 
    errorEstimationFunction=PacBioErrfun, 
    BAND_SIZE=32, 
    multithread=36,
    verbose=TRUE)

image

Updated version

Using the code from https://github.com/benjjneb/dada2/issues/1307#issuecomment-957680971 (option 4):

suppressPackageStartupMessages(library(dada2))
suppressPackageStartupMessages(library(dplyr))

# see https://github.com/benjjneb/dada2/issues/1307
# and https://github.com/benjjneb/dada2/issues/1892

# THIS IS EXPERIMENTAL
loessErrfun_mod4 <- function(trans) {
  qq <- as.numeric(colnames(trans))
  est <- matrix(0, nrow=0, ncol=length(qq))
  for(nti in c("A","C","G","T")) {
    for(ntj in c("A","C","G","T")) {
      if(nti != ntj) {
        errs <- trans[paste0(nti,"2",ntj),]
        tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
        rlogp <- log10((errs+1)/tot)  # 1 psuedocount for each err, but if tot=0 will give NA
        rlogp[is.infinite(rlogp)] <- NA
        df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)

        # original
        # ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
        # mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
        # #        mod.lo <- loess(rlogp ~ q, df)

        # jonalim's solution
        # https://github.com/benjjneb/dada2/issues/938
        mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),degree = 1, span = 0.95)

        pred <- predict(mod.lo, qq)
        maxrli <- max(which(!is.na(pred)))
        minrli <- min(which(!is.na(pred)))
        pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
        pred[seq_along(pred)<minrli] <- pred[[minrli]]
        est <- rbind(est, 10^pred)
      } # if(nti != ntj)
    } # for(ntj in c("A","C","G","T"))
  } # for(nti in c("A","C","G","T"))

  # HACKY
  MAX_ERROR_RATE <- 0.25
  MIN_ERROR_RATE <- 1e-7
  est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
  est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE

  # enforce monotonicity
  # https://github.com/benjjneb/dada2/issues/791
  estorig <- est
  est <- est %>%
    data.frame() %>%
    mutate_all(funs(case_when(. < X40 ~ X40,
                              . >= X40 ~ .))) %>% as.matrix()
  rownames(est) <- rownames(estorig)
  colnames(est) <- colnames(estorig)

  # Expand the err matrix with the self-transition probs
  err <- rbind(1-colSums(est[1:3,]), est[1:3,],
               est[4,], 1-colSums(est[4:6,]), est[5:6,],
               est[7:8,], 1-colSums(est[7:9,]), est[9,],
               est[10:12,], 1-colSums(est[10:12,]))
  rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
  colnames(err) <- colnames(trans)
  # Return
  return(err)
}

filts <- list.files('.', pattern="filtered.fastq.gz", full.names = TRUE)
sample.namesF <- sapply(strsplit(basename(filts), "_"), `[`, 1) # Assumes filename = samplename_XXX.fastq.gz
set.seed(100)

dereps <- derepFastq(filts, n=100000, verbose=TRUE)

# Learn forward error rates
errs <- learnErrors(dereps, 
    nbases = 1e9,
    errorEstimationFunction=loessErrfun_mod4, 
    BAND_SIZE=32, 
    multithread=36,
    verbose=2)

image

Conclusion

At the moment we're going with the second one, but we're about to have a meeting with some PacBio folks who can hopefully give more insight. @benjjneb any thoughts on this?

benjjneb commented 4 months ago

Visually the fit with the updated version looks much better. Without a deep understanding, my surface read is that the updated code might be more effectively working across Q scores with very low representation in the data, and thus the improved visual fitting could be explained by a better match between the algorithm and this sort of binned-Q data.

We are actually looking at PacBio Kinnex/Revio for a current project so this is very timely. I haven't ever seen real data from the platform before, but googling around I just found this: https://downloads.pacbcloud.com/public/dataset/Kinnex-16S/

cjfields commented 4 months ago

Visually the fit with the updated version looks much better. Without a deep understanding, my surface read is that the updated code might be more effectively working across Q scores with very low representation in the data, and thus the improved visual fitting could be explained by a better match between the algorithm and this sort of binned-Q data.

Thx, but the original implementation was from the Illumina NovaSeq binning tests from @jonalim (which I think came from someone in his group...). That this works across binned data from different technologies is interesting on it's own.

We are actually looking at PacBio Kinnex/Revio for a current project so this is very timely. I haven't ever seen real data from the platform before, but googling around I just found this: https://downloads.pacbcloud.com/public/dataset/Kinnex-16S/

Yep, that's part of the data I had tried here:

https://github.com/benjjneb/dada2/issues/1892#issuecomment-1961647289

32-sample Zymo mock communities. Similar pattern when we ran it. I noticed they also have Kinnex/Sequel data.

cjfields commented 3 months ago

Small update: based on feedback from our PacBio BI rep it sounds like SMRTLink can be configured to generate full quality scores for Revio, though it sounds like these top out at Q40 regardless (unlike Sequel with the Q93). It requires a custom run sheet for now. Default is to bin all data.

benjjneb commented 3 months ago

Thanks for updates. We have a call with PacBio about Kinnex tomorrow, and I'm going to reach out to them about their quality score approach going forward too.

cjfields commented 3 months ago

@benjjneb to add to this, we redid a Kinnex run but set it to generate the full quality range. Despite PacBio's indication these would top out at 40 they do go all the way to Q93 (using the Kinnex correction above):

image

We're doing a second run through switching back to the standard PacBio model, which may work better in this case.

cjfields commented 3 months ago

As a final addition, if you use the full quality scores the original PacBio error function in DADA2 it seems to work well:

image

Note that the qualities go all the way to 93 again. However, I noticed that the dada() step takes a tremendous amount of time, approximately 3-4 days using 36 cores and 300GB mem with a full SMRTcell and ~200 samples.

jonalim commented 2 months ago

I appreciate this thread a whole lot @cjfields ! I had also noticed the reduced quality range on our first Revio Kinnex run. Unrelated to this, our flowcell was underloaded(sic?) and the median read quality less than what it should have been, so I was confused as to what was going on.

I'm actually unconvinced that the error model a lot of people have been attributing to me (with loess weights = log10(tot), span=0.95, degree = 1) is better than the original, because it often yields a fit that underestimates the error rate of Q37 (illumina) / Q40 (PacBio) bases, but I'll save that for another post.

cjfields commented 2 months ago

@jonalim agreed, and I think this is where having a decent mock data set would be useful to assess this. There are ones available from PacBio which we have tested and which I linked to above.

Also, re: the median quality issue, is it possible the data were not filtered to 99.9% accuracy? We've seen something like this when we were mistakenly given default PacBio output (99%), which did not work.