Open cjfields opened 4 months ago
It does appear there are issues with learnErrors
, at least when using our workflow. We run this script:
filts <- list.files('.', pattern="filtered.fastq.gz", full.names = TRUE)
sample.namesF <- sapply(strsplit(basename(filts), "_"), `[`, 1) # Assumes filename = samplename_XXX.fastq.gz
set.seed(100)
# Learn forward error rates
errs <- learnErrors(filts,
errorEstimationFunction=PacBioErrfun,
BAND_SIZE=32,
multithread=4)
pdf(paste0("R1",".err.pdf"))
plotErrors(errs, nominalQ=TRUE)
dev.off()
The error appears to be due to the binning. From our Nextflow output
Command exit status:
140
Command output:
213565386 total bases in 147764 reads from 1 samples will be used for learning the error rates.
Command error:
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.
We're working through the workflow using R+dada2 and a small data set to manually diagnose it and see if we can create a workaround.
It looks like the issue above was a workflow problem (the task timed out). However we are seeing increased loss of sequence data compared to past runs, and there are similar issues with the binning step as seen for the NovaSeq runs interestingly enough. I suspect this might be from differences in estimated error as the problem occurs after filterAndTrim
. The error profile also looks like an issue, similar to the dip seen in NovaSeq runs.
Attached are the results from three random samples from a larger 174 sample run using plotQualityProfile
(you can ignore the 'aggregate' labels, these are from individual samples). Lots of read data per sample. I'll see if I can dig up some comparative standard Sequel IIe runs.
aggregate-qualities.12.pdf aggregate-qualities.27.pdf aggregate-qualities.77.pdf
This is the error profile:
Note the dip.
EDIT: I am a bit concerned about the 'flatness' of that error frequency line compared to Illumina and older PacBio data, like from @benjjneb 's past runs, e.g. https://benjjneb.github.io/LRASManuscript/LRASms_fecal.html.
We just tried a run using the alternative function for error estimation provided by @jonalim and tested by @hhollandmoritz here: https://github.com/benjjneb/dada2/issues/1307#issuecomment-957680971 (option 4). This seems to help, though the data are pretty sparse; we're working on newer runs with more nbases
to see if this improves anything
This is an updated plot with nbases=1e10
I should mention that the PacBio workflow steps for Kinnex still appear to use the original PacBioErrFunc
. We are downloading their Zymo mock samples to see if the pattern above holds:
I should also mention these data (and our in-house data) are from the Revio, I'll update the title.
Error profiles for the mock community samples from here:
PacBioErrfun
code:
errs <- learnErrors(dereps,
nbases = 1e9,
errorEstimationFunction=PacBioErrfun,
BAND_SIZE=32,
multithread=36,
verbose=TRUE)
Using the code from https://github.com/benjjneb/dada2/issues/1307#issuecomment-957680971 (option 4):
suppressPackageStartupMessages(library(dada2))
suppressPackageStartupMessages(library(dplyr))
# see https://github.com/benjjneb/dada2/issues/1307
# and https://github.com/benjjneb/dada2/issues/1892
# THIS IS EXPERIMENTAL
loessErrfun_mod4 <- function(trans) {
qq <- as.numeric(colnames(trans))
est <- matrix(0, nrow=0, ncol=length(qq))
for(nti in c("A","C","G","T")) {
for(ntj in c("A","C","G","T")) {
if(nti != ntj) {
errs <- trans[paste0(nti,"2",ntj),]
tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
rlogp <- log10((errs+1)/tot) # 1 psuedocount for each err, but if tot=0 will give NA
rlogp[is.infinite(rlogp)] <- NA
df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)
# original
# ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
# mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
# # mod.lo <- loess(rlogp ~ q, df)
# jonalim's solution
# https://github.com/benjjneb/dada2/issues/938
mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),degree = 1, span = 0.95)
pred <- predict(mod.lo, qq)
maxrli <- max(which(!is.na(pred)))
minrli <- min(which(!is.na(pred)))
pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
pred[seq_along(pred)<minrli] <- pred[[minrli]]
est <- rbind(est, 10^pred)
} # if(nti != ntj)
} # for(ntj in c("A","C","G","T"))
} # for(nti in c("A","C","G","T"))
# HACKY
MAX_ERROR_RATE <- 0.25
MIN_ERROR_RATE <- 1e-7
est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE
# enforce monotonicity
# https://github.com/benjjneb/dada2/issues/791
estorig <- est
est <- est %>%
data.frame() %>%
mutate_all(funs(case_when(. < X40 ~ X40,
. >= X40 ~ .))) %>% as.matrix()
rownames(est) <- rownames(estorig)
colnames(est) <- colnames(estorig)
# Expand the err matrix with the self-transition probs
err <- rbind(1-colSums(est[1:3,]), est[1:3,],
est[4,], 1-colSums(est[4:6,]), est[5:6,],
est[7:8,], 1-colSums(est[7:9,]), est[9,],
est[10:12,], 1-colSums(est[10:12,]))
rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
colnames(err) <- colnames(trans)
# Return
return(err)
}
filts <- list.files('.', pattern="filtered.fastq.gz", full.names = TRUE)
sample.namesF <- sapply(strsplit(basename(filts), "_"), `[`, 1) # Assumes filename = samplename_XXX.fastq.gz
set.seed(100)
dereps <- derepFastq(filts, n=100000, verbose=TRUE)
# Learn forward error rates
errs <- learnErrors(dereps,
nbases = 1e9,
errorEstimationFunction=loessErrfun_mod4,
BAND_SIZE=32,
multithread=36,
verbose=2)
At the moment we're going with the second one, but we're about to have a meeting with some PacBio folks who can hopefully give more insight. @benjjneb any thoughts on this?
Visually the fit with the updated version looks much better. Without a deep understanding, my surface read is that the updated code might be more effectively working across Q scores with very low representation in the data, and thus the improved visual fitting could be explained by a better match between the algorithm and this sort of binned-Q data.
We are actually looking at PacBio Kinnex/Revio for a current project so this is very timely. I haven't ever seen real data from the platform before, but googling around I just found this: https://downloads.pacbcloud.com/public/dataset/Kinnex-16S/
Visually the fit with the updated version looks much better. Without a deep understanding, my surface read is that the updated code might be more effectively working across Q scores with very low representation in the data, and thus the improved visual fitting could be explained by a better match between the algorithm and this sort of binned-Q data.
Thx, but the original implementation was from the Illumina NovaSeq binning tests from @jonalim (which I think came from someone in his group...). That this works across binned data from different technologies is interesting on it's own.
We are actually looking at PacBio Kinnex/Revio for a current project so this is very timely. I haven't ever seen real data from the platform before, but googling around I just found this: https://downloads.pacbcloud.com/public/dataset/Kinnex-16S/
Yep, that's part of the data I had tried here:
https://github.com/benjjneb/dada2/issues/1892#issuecomment-1961647289
32-sample Zymo mock communities. Similar pattern when we ran it. I noticed they also have Kinnex/Sequel data.
Small update: based on feedback from our PacBio BI rep it sounds like SMRTLink can be configured to generate full quality scores for Revio, though it sounds like these top out at Q40 regardless (unlike Sequel with the Q93). It requires a custom run sheet for now. Default is to bin all data.
Thanks for updates. We have a call with PacBio about Kinnex tomorrow, and I'm going to reach out to them about their quality score approach going forward too.
@benjjneb to add to this, we redid a Kinnex run but set it to generate the full quality range. Despite PacBio's indication these would top out at 40 they do go all the way to Q93 (using the Kinnex correction above):
We're doing a second run through switching back to the standard PacBio model, which may work better in this case.
As a final addition, if you use the full quality scores the original PacBio error function in DADA2 it seems to work well:
Note that the qualities go all the way to 93 again. However, I noticed that the dada()
step takes a tremendous amount of time, approximately 3-4 days using 36 cores and 300GB mem with a full SMRTcell and ~200 samples.
I appreciate this thread a whole lot @cjfields ! I had also noticed the reduced quality range on our first Revio Kinnex run. Unrelated to this, our flowcell was underloaded(sic?) and the median read quality less than what it should have been, so I was confused as to what was going on.
I'm actually unconvinced that the error model a lot of people have been attributing to me (with loess weights = log10(tot), span=0.95, degree = 1
) is better than the original, because it often yields a fit that underestimates the error rate of Q37 (illumina) / Q40 (PacBio) bases, but I'll save that for another post.
@jonalim agreed, and I think this is where having a decent mock data set would be useful to assess this. There are ones available from PacBio which we have tested and which I linked to above.
Also, re: the median quality issue, is it possible the data were not filtered to 99.9% accuracy? We've seen something like this when we were mistakenly given default PacBio output (99%), which did not work.
We're just getting in our first standard 16S data from the PacBio Revio which is using the Kinnex kit, and will soon be getting 16S-ITS-23S data. I know the FASTQ data are binned now (similar to NovaSeq I believe). Is anyone aware of issues with these? I'm more concerned about issues similar to the NovaSeq binning issues documented elsewhere:
791
1307
I'm also wondering whether the kit itself (which concatenates the reads prior to sequencing, and then splits the resulting reads after sequencing based on PacBio adapters) may also affect the error profile.