benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

Binned quality scores and their effect on (non-decreasing) trans rates #1307

Open JacobRPrice opened 3 years ago

JacobRPrice commented 3 years ago

Overview

We are encountering some issues in obtaining appropriate estimates of error rates; I believe these problems are coming from our new(-er) sequencing facility sending us fastq files containing binned quality scores. Searching the issues, it seems like this is not an uncommon situation, but I'm not sure that a single answer or solution has been offered yet.

My observations

Example of forward read quality profiles: ReadQuality_post-cutadapt_F

And the corresponding reverse read quality profiles: ReadQuality_post-cutadapt_R

Our initial attempts followed a typical pipeline. Instead of pre-learning error rates, we wanted to use the full dataset to avoid bias due to which sample(s) were used. This particular dataset is a fairly large and (what I'd consider) very deeply sequenced (300,000-60,000 reads per sample). In the fitted error rates, we saw the characteristic dips that are often caused by binned quality scores.

# First attempt approach
dadaFs <- dada(
  derep = filtFs,
  err = NULL,
  selfConsist = TRUE,
  pool = FALSE,
  multithread = parallel::detectCores() - 1, 
  verbose = TRUE
)

A2G is especially nasty! err1

In issue #938, @hjruscheweyh laid out a near identical problem (to the one I'm having), with his colleague @GuillemSalazar graciously offering up the substitute function they used to address/correct this issue [link to comment containing function]. Their approach attempted to enforce monotonicity by changing the arguments of the loess function to have span equal to 2 and weights equal to the log-transformed totals.

I then tested out their approach on our data, this time thinking that I should follow the tutorial's suggestion to pre-learn error rates. For both learnErrors and dada I passed their loessErrfun_mod to the errorEstimationFunction parameter.

errF <- learnErrors(
  filtFs, 
  multithread = TRUE, 
  errorEstimationFunction = loessErrfun_mod,
  verbose = TRUE
)

dadaFs <- dada(
  derep = filtFs,
  err = errF,
  pool = FALSE,
  errorEstimationFunction = loessErrfun_mod,
  multithread = TRUE, 
  verbose = TRUE
)

The results from learnErrors indicated that this might have done the trick! learnErrors_errF_firsttrynoselfconsistondada

But when I pulled the final error model from dadaFs I was disappointed to see that all of the fitted parameters seemed to have "flattened" (indicating less of a response in error frequency to the quality score), and furthermore the error frequencies for A2G actually increased with increasing quality scores, which should never happen in reality. dada_errF_firsttrynoselfconsistondada

It's possible that these results came from pre-learning the error rates (and any bias that may result in) or, perhapse more likely, from not allowing dada to selfConsist. To check this, I reran the same data a second time but passing TRUE to selfConsist.

errF <- learnErrors(
  filtFs, 
  multithread = TRUE, 
  errorEstimationFunction = loessErrfun_mod,
  verbose = TRUE
)
dadaFs <- dada(
  derep = filtFs,
  err = errF,
  selfConsist = TRUE,
  pool = FALSE,
  errorEstimationFunction = loessErrfun_mod,
  multithread = TRUE, 
  verbose = TRUE
)

While the self consistency loop for dada() terminated before we saw convergence, the fluctuations were small enough that it is probably ok to proceed with the results (or at least check them out).

 [1] 1.432041e+01 9.763790e-01 2.606707e-02 6.451500e-03 9.615610e-04
 [6] 1.049815e-03 5.119809e-04 6.031202e-04 5.129712e-04 6.957217e-04

The error model obtained from learnErrors was identical to the previous trial (same arguments/samples were being passed, so that is expected). learnErrors_errF

Much to my dismay, the error model, while not identical, was very similar, with the same increasing error frequency for A2G. dada_errF

Illumina Customer Service on Binned Quality Scores

I spent some time on the phone with Illumina and both of the techs I spoke with indicated that binned quality scores are here to stay and that the quality score values that comprise each bin may vary according to Illumina platform and software version. Neither one was able to find me any information about whether or not binning can be turned off by the sequencing facility themselves.

That said, I'm not sure how prevalent the use of binned quality scores are across all sequencing facilities, this particular dataset is the first time I've encountered them. I agree with @wangjiawen2013 comment in #971 that this has the potential to become more and more of an issue.

Questions (Finally, Jake, stop jabbering!)

How severe of a problem are models with increasing error frequencies?

It isn't intuitive that this should be possible (or at least likely). I haven't come across anyone in the issues plotting the final error models (coming out of dada), so I don't know how unique our particular dataset is.

"Official" recommendations for handling binned quality data

I'd like to suggest, or more properly, encourage, that recommendations for how to handle binned quality data be established. @benjjneb has commented 2 that they're waiting on the appropriate data needed to do this development.

Updating learnErrors() (and hence, dada())

Are there plans in place to add functionality to learnErrors enabling users to enforce monotonicity or otherwise alter its behavior if binned quality scores are present?

Communication with other developers/users

Lastly, the dada2 plugin for QIIME2 does not offer much in the way of intermediate output or the ability to validate the results. If sequencing data with binned quality scores is truely an issues, unwary users may be getting bad results. This last item is more for investigators/labs who treat QIIME, dada2, other pipelines as black boxes and are primarily concerned with the output. I understand the Callahan Lab isn't responsible for the development decisions of the QIIME team, but it may be worthwhile to communicate these challenges so that they can be addressed.

Other Issues regarding this question:

Included for others who may be interested as well.

791 #938 #964 #1083 #1135 #1228

benjjneb commented 3 years ago

Hi Jake, I just wanted to say that your contribution here has not gone unnoticed. In fact it is highly appreciated, especially how you have not only contributed new data on this issue, but collated several existing issues as well.

I consider this the issue for binned quality scores, and will direct future related issues this way. More to come.

JacobRPrice commented 3 years ago

A follow up to my original post:

I ended up trying a couple of different additional approaches/variants of a modified loessErrfun to see what the outcomes would look like. Perhaps I have access to more computing power than I do good sense, but I'm a visual learner and it helped me understand what the effects of such changes may have.

Two generalizable solutions have been offered so far. 1) Altering the arguments passed to the loess 2) Enforcing monotonicity in the error rates

1 ) altering loess This solution, as discussed above, was offered up by @hjruscheweyh and @GuillemSalazar in Issue #938. As they commented:

We tried to enforce monotonicity by changing the parameters (span=2 and the log-transformed totals as weights) of the loess function used by loessErrfun():

They used:

# Guillem's solution
mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),span = 2)

while the original/standard function uses total counts for weights and the default value (0.75) for the span parameter.

# original
mod.lo <- loess(rlogp ~ q, df, weights=tot)

I was curious to see what the impact of varying just the weights argument was, as opposed to changing the span value at the same time (two things moving at the same time makes it hard to predict the effects of just one).

2 ) Enforcing monotonicity in the error rates This has been suggested in a couple of locations/issues by @benjjneb and @mikemc , but @hhollandmoritz put together a great issue exploring the effects of using NovaSeq data in Issue #791. They described how they approached enforcing some degree of monotonic behavior by assigning any (fitted) value lower than the Q40 to be the q40 value in a comment in that thread. In the same thread, @cjfields provided confirmation that they've observed the same behavior and found the approach to be useful.

# hhollandmortiz
NSnew_errR_out <- getErrors(NSerrR_mon) %>%
  data.frame() %>%
  mutate_all(funs(case_when(. < X40 ~ X40,
                            . >= X40 ~ .))) %>% as.matrix()
rownames(NSnew_errR_out) <- rownames(getErrors(NSerrR_mon))
colnames(NSnew_errR_out) <- colnames(getErrors(NSerrR_mon))

It sounds like it worked for them, so maybe it would work for us too!

Trials

Summarizing what trials I wanted to test (for clarity): 1) alter loess arguments (weights and span) & enforce monotonicity 2) enforce monotonicity 3) alter loess arguments (weights only) & enforce monotonicity

alter loess arguments (weights and span) & enforce monotonicity

Why not try both of the suggested solutions at the same time, anything worth doing is worth overdoing, amiright?

library(magrittr)
library(dplyr)

loessErrfun_mod <- function(trans) {
  qq <- as.numeric(colnames(trans))
  est <- matrix(0, nrow=0, ncol=length(qq))
  for(nti in c("A","C","G","T")) {
    for(ntj in c("A","C","G","T")) {
      if(nti != ntj) {
        errs <- trans[paste0(nti,"2",ntj),]
        tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
        rlogp <- log10((errs+1)/tot)  # 1 psuedocount for each err, but if tot=0 will give NA
        rlogp[is.infinite(rlogp)] <- NA
        df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)

        # original
        # ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
        # mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
        # #        mod.lo <- loess(rlogp ~ q, df)

        # Gulliem Salazar's solution
        # https://github.com/benjjneb/dada2/issues/938
        mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),span = 2)

        pred <- predict(mod.lo, qq)
        maxrli <- max(which(!is.na(pred)))
        minrli <- min(which(!is.na(pred)))
        pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
        pred[seq_along(pred)<minrli] <- pred[[minrli]]
        est <- rbind(est, 10^pred)
      } # if(nti != ntj)
    } # for(ntj in c("A","C","G","T"))
  } # for(nti in c("A","C","G","T"))

  # HACKY
  MAX_ERROR_RATE <- 0.25
  MIN_ERROR_RATE <- 1e-7
  est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
  est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE

  # enforce monotonicity
  # https://github.com/benjjneb/dada2/issues/791
  estorig <- est
  est <- est %>%
    data.frame() %>%
    mutate_all(funs(case_when(. < X40 ~ X40,
                              . >= X40 ~ .))) %>% as.matrix()
  rownames(est) <- rownames(estorig)
  colnames(est) <- colnames(estorig)

  # Expand the err matrix with the self-transition probs
  err <- rbind(1-colSums(est[1:3,]), est[1:3,],
               est[4,], 1-colSums(est[4:6,]), est[5:6,],
               est[7:8,], 1-colSums(est[7:9,]), est[9,],
               est[10:12,], 1-colSums(est[10:12,]))
  rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
  colnames(err) <- colnames(trans)
  # Return
  return(err)
}

# check what this looks like
errF <- learnErrors(
  filtFs,
  multithread = TRUE,
  errorEstimationFunction = loessErrfun_mod,
  verbose = TRUE
)

learnErrors_errF_span2_enforceMono

Taking this approach seemed to work really well. The curves are smooth without the hooks/dips that we see in the default approach. It also does not have the increasing A2G pattern like we saw in @GuillemSalazar 's approach (which only made the changes to loess).

enforce monotonicity

But how much of that improvement due to just enforcing monotonicity? Let's revert back to the original loess call and enforce monotonicity by adapting @hhollandmoritz 's process.

library(magrittr)
library(dplyr)

loessErrfun_mod <- function(trans) {
  qq <- as.numeric(colnames(trans))
  est <- matrix(0, nrow=0, ncol=length(qq))
  for(nti in c("A","C","G","T")) {
    for(ntj in c("A","C","G","T")) {
      if(nti != ntj) {
        errs <- trans[paste0(nti,"2",ntj),]
        tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
        rlogp <- log10((errs+1)/tot)  # 1 psuedocount for each err, but if tot=0 will give NA
        rlogp[is.infinite(rlogp)] <- NA
        df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)

        # original
        # ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
        mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
        # #        mod.lo <- loess(rlogp ~ q, df)

        # Gulliem Salazar's solution
        # https://github.com/benjjneb/dada2/issues/938
        # mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),span = 2)

        pred <- predict(mod.lo, qq)
        maxrli <- max(which(!is.na(pred)))
        minrli <- min(which(!is.na(pred)))
        pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
        pred[seq_along(pred)<minrli] <- pred[[minrli]]
        est <- rbind(est, 10^pred)
      } # if(nti != ntj)
    } # for(ntj in c("A","C","G","T"))
  } # for(nti in c("A","C","G","T"))

  # HACKY
  MAX_ERROR_RATE <- 0.25
  MIN_ERROR_RATE <- 1e-7
  est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
  est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE

  # enforce monotonicity
  # https://github.com/benjjneb/dada2/issues/791
  estorig <- est
  est <- est %>%
    data.frame() %>%
    mutate_all(funs(case_when(. < X40 ~ X40,
                              . >= X40 ~ .))) %>% as.matrix()
  rownames(est) <- rownames(estorig)
  colnames(est) <- colnames(estorig)

  # Expand the err matrix with the self-transition probs
  err <- rbind(1-colSums(est[1:3,]), est[1:3,],
               est[4,], 1-colSums(est[4:6,]), est[5:6,],
               est[7:8,], 1-colSums(est[7:9,]), est[9,],
               est[10:12,], 1-colSums(est[10:12,]))
  rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
  colnames(err) <- colnames(trans)
  # Return
  return(err)
}

# check what this looks like
errF <- learnErrors(
  filtFs,
  multithread = TRUE,
  errorEstimationFunction = loessErrfun_mod,
  verbose = TRUE
)

learnErrors_errF_enforceMono

Without altering the loess arguments, we still see some sharp peaks and dips, but they're much smaller in magnitude, so some degree of smoothing is offered by this approach. In a couple of cases there doesn't appear to be much of a response to quality score (A2G, T2C), or at least for my data, the model is not doing a good job of estimating the parameters.

alter loess arguments (weights only) & enforce monotonicity

Can we improve upon the previous trial by modifying loess weights argument (in addition to enforcing monotonicity)?

library(magrittr)
library(dplyr)

loessErrfun_mod <- function(trans) {
  qq <- as.numeric(colnames(trans))
  est <- matrix(0, nrow=0, ncol=length(qq))
  for(nti in c("A","C","G","T")) {
    for(ntj in c("A","C","G","T")) {
      if(nti != ntj) {
        errs <- trans[paste0(nti,"2",ntj),]
        tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
        rlogp <- log10((errs+1)/tot)  # 1 psuedocount for each err, but if tot=0 will give NA
        rlogp[is.infinite(rlogp)] <- NA
        df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)

        # original
        # ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
        # mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
        # #        mod.lo <- loess(rlogp ~ q, df)

        # Gulliem Salazar's solution
        # https://github.com/benjjneb/dada2/issues/938
        # mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),span = 2)

        # only change the weights
        mod.lo <- loess(rlogp ~ q, df, weights = log10(tot))

        pred <- predict(mod.lo, qq)
        maxrli <- max(which(!is.na(pred)))
        minrli <- min(which(!is.na(pred)))
        pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
        pred[seq_along(pred)<minrli] <- pred[[minrli]]
        est <- rbind(est, 10^pred)
      } # if(nti != ntj)
    } # for(ntj in c("A","C","G","T"))
  } # for(nti in c("A","C","G","T"))

  # HACKY
  MAX_ERROR_RATE <- 0.25
  MIN_ERROR_RATE <- 1e-7
  est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
  est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE

  # enforce monotonicity
  # https://github.com/benjjneb/dada2/issues/791
  estorig <- est
  est <- est %>%
    data.frame() %>%
    mutate_all(funs(case_when(. < X40 ~ X40,
                              . >= X40 ~ .))) %>% as.matrix()
  rownames(est) <- rownames(estorig)
  colnames(est) <- colnames(estorig)

  # Expand the err matrix with the self-transition probs
  err <- rbind(1-colSums(est[1:3,]), est[1:3,],
               est[4,], 1-colSums(est[4:6,]), est[5:6,],
               est[7:8,], 1-colSums(est[7:9,]), est[9,],
               est[10:12,], 1-colSums(est[10:12,]))
  rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
  colnames(err) <- colnames(trans)
  # Return
  return(err)
}

# check what this looks like
errF <- learnErrors(
  filtFs,
  multithread = TRUE,
  errorEstimationFunction = loessErrfun_mod,
  verbose = TRUE
)

learnErrors_errF_weights_enforceMono

Without adjusting the span argument, we still have dips and peaks, although they are much less severe. This is most likely due to the effects of altering the argument passed to weight. Furthermore, enforcing decreasing error rates did it's job as expected.

Decisions, Decisions (for now)

I think that for this particular dataset (I can't emphasize that enough), using the combined approach (Trial 1 above) works best. I am currently running the dada2 pipeline using that version, and I'm looking forward to see what the results look like; I will make sure to post an update once it has completed.

Request for feedback

If anyone has any feedback on these preliminary trials, or my rationale, I'd greatly appreciate the feedback. Analysis paralysis can be quite real!

Jake

cjfields commented 3 years ago

@JacobRPrice really impressive work! Would be great to get @benjjneb thoughts, though this argues that one could at least try different approaches with some follow-up QA to see what works best.

benjjneb commented 3 years ago

Hi Jake, We are digging into this in more detail in collaboration with a student here at NC State @Y-Q-Si (email address ysi4@ncsu.edu) who is interested in the error model for the binned quality score. Would you mind sharing the particular dataset that generated the plots above, either on this thread or to her email address, so we can start doing some testing and evaluation of our own? Thank you in advance.

JacobRPrice commented 3 years ago

Hi @benjjneb and @Y-Q-Si,

I sent an email briefly describing the data and shared a google drive link to the raw data as a tarball. Please let me know if you have any questions or I can provide clarification.


Outline of QC protocol:

cutadapt was used to remove primers

FWD <- "GTGYCAGCMGCCGCGGTAA"
REV <- "GGACTACNVGGGTWTCTAAT"

After removing the primers, we carried out QC using the following filterAndTrim function call:

out.filtN_cut_filt <- filterAndTrim(
  fwd=cutFs,
  filt=filtFs,
  rev=cutRs,
  filt.rev=filtRs,
  truncLen=c(200,200),
  maxN=0,
  maxEE=2,
  truncQ=2,
  compress=TRUE,
  verbose=TRUE,
  multithread = parallel::detectCores() - 1
)

I can send you my exact scripts if you need more detailed information.

Jake

jonalim commented 3 years ago

@JacobRPrice @benjjneb I'm a bioinformatics analyst from the University of Maryland. I've experimented with error modeling on a NovaSeq dataset using the following loess function, which weights by totals and uses a degree of 1. I figured that a degree of 1 would force the underlying fits to be linear, thus "discouraging" the final model from changing direction. (The resulting models still changed direction, but oh well.) I didn't enforce monotonicity, but I think it would help to do so, to remove the dips in the models during learnErrors().

mod.lo <- loess(rlogp ~ q, df, weights=tot, degree=1, span = 0.95)
learnErrors(nbases=1e8, ...) post-dada
errF tot deg1 span95 pre_dada errF tot deg1 span95 post_dada

My understanding is that the "totals" here are highest for the three quality scores that actually appear in my NovaSeq reads: 11, 23, and 37. Allowing those three datapoints to dominate the loess fit seems to maintain an overall downward slope in the post-DADA fits, which seems desirable to me.

I've also tried weighting by log(tot). This yielded a tighter fit, which makes sense because the weights have less variance. In some cases though (G2A, T2C, C2T post-DADA), the fit had error frequency increasing with the consensus quality score.

mod.lo <- loess(rlogp ~ q, df, weights=log(tot), degree=1, span = 0.95)
learnErrors(nbases=1e8, ...) post-dada
errF logtot deg1 span95 pre_dada errF logtot deg1 span95 post_dada

Unfortunately, I'm not at liberty to share my own dataset, but like yourself, I am curious to find out if these parameters are generalizable to other datasets.

Here are the results when I used @JacobRPrice's "Trial 1" model function on my dataset. learnErrors(nbases=1e8, ...) post-dada
errF log10tot span2 monoton pre_dada errF log10tot span2 monoton post_dada
benjjneb commented 3 years ago

Thanks @jonalim !

Pinging @Y-Q-Si

Y-Q-Si commented 3 years ago

Hi @jonalim, Thanks for your comments. I am running some tests with Jack's data. I am wondering if you would mind sharing the "trans" matrix with me? It would be very helpful if I can run the same tests on both matrices. Thank you in advance.

Andreas-Bio commented 3 years ago

I have the same issue. I do not have binned quality scores.

https://github.com/benjjneb/dada2/issues/1343

The solution of @JacobRPrice's "Trial 1" does not work for me.


errF <- learnErrors(inputfiles, multithread = T, randomize = T, nbases = 1e20, errorEstimationFunction = loessErrfun_mod)
404252934 total bases in 7873078 reads from 180 samples will be used for learning the error rates.
Error rates could not be estimated (this is usually because of very few reads).
Fehler in getErrors(err, enforce = TRUE) : Error matrix is NULL.
jonalim commented 3 years ago

@Y-Q-Si Certainly. Here's a link to an RDS and the script used to create it. The RDS contains a list of three lists, one for each error model function. Each internal list contains:

https://drive.google.com/drive/folders/11GThjrUKdcL_n64EbXdxf1Ww_gXqPE87?usp=sharing

JacobRPrice commented 3 years ago

@andzandz11 The error you're seeing is most likely due to a single (or perhaps more) unique sequence(s) within your dataset. In your issue (#1343), your quality profile plots shows that you have a tiny fraction of very long reads (in comparison with the others within the dataset). You may want to look further into your filtering/trimming parameters and make sure that the read length distribution is more reasonable.

jeffkimbrel commented 3 years ago

@andzandz11 - you are also trying to use 1e20 bp, which seems an unreasonably large number.

Overall, are there any solid recommendations yet? We've received word from our sequencing center that they are about to run a few hundred of our samples on a NovaSeq (instead of the usual MiSeq), so we are about to be in the same situation of fitting error models against binned quality scores.

Also, I first ran up against this issue with Illumina RTA3 not with amplicon data, but with variant calling and the need there to distinguish between low-frequency variants and sequencing error. I wonder if the folks developing those tools, such as LoFreq (@CSB5) or similar, have some clever ways to use the bins.

JacobRPrice commented 2 years ago

I'll second Jeff's request. I'd love to hear any recommendations official or otherwise. We have several NovaSeq datasets that are currently tabled.

hhollandmoritz commented 2 years ago

Hi All!

Just rejoining this conversation after a long hiatus. First @JacobRPrice thank you for all the work you've been doing on this. It's extremely helpful. I will go ahead and try your three methods on a recent NovaSeq soils dataset and report back.

Second, I am now working with a new sequencing center that due to pandemic-related issues will be sequencing all amplicon data only on NovaSeq, so this challenge has recently become much more pressing on my end.

The good news is that in my group we may actually have the resources to create a dataset with the same samples sequenced on HiSeq and resequenced on NovaSeq. @benjjneb are you still looking for a comparison dataset for this? If so, I can see if we can push that project forward. If not, is the comparison dataset you're currently using public?

jcmcnch commented 2 years ago

Hi all, we're encountering some issues in two recent NovaSeq runs so will be jumping on this wagon too! Just FYI @hhollandmoritz @JacobRPrice @benjjneb @Y-Q-Si we have some 16S and 18S seawater microbial community mocks of known composition that are not denoising well from two recent NovaSeq runs and they've been sequenced many times before (HiSeq, MiSeq, you name it). In these recent NovaSeq runs we're getting lots of 1-mismatches to the mocks that represent a good fraction of the total reads (up to something like 8-10% of the mock which is a bit disturbing and not at all usual), and it seems likely that it's related to the issues discussed here. @JacobRPrice I also agree this is a pretty big issue for qiime2 users - we normally use qiime2 and have had good luck with default DADA2 parameters before but with our recent issues there's really no way to troubleshoot natively in qiime2. I am working with someone in the lab who is much more savvy with R than I am who will help implement the code, and we'll report back after trying the "combined" solution discussed above. BTW if there's interest in others getting the mock data to play with I could ask Jed if he'd mind us sharing... In any case, thanks everyone for all the really useful info and suggestions!

hhollandmoritz commented 2 years ago

So time to report back. Like @JacobRPrice I ran the three options, but I also added a 4th trial based on @jonalim's method. The samples are from arctic soils so pretty high diversity environment. I'm not quite sure I understand what the differences are between the errors generated from the pre-learning step and from after dada2, but I'd be happy to provide those two if anyone thinks they'd be helpful.

Option 1: Alters the weights and span in loess, also enforce monotonicity

loessErrfun_mod1 <- function(trans) {
  qq <- as.numeric(colnames(trans))
  est <- matrix(0, nrow=0, ncol=length(qq))
  for(nti in c("A","C","G","T")) {
    for(ntj in c("A","C","G","T")) {
      if(nti != ntj) {
        errs <- trans[paste0(nti,"2",ntj),]
        tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
        rlogp <- log10((errs+1)/tot)  # 1 psuedocount for each err, but if tot=0 will give NA
        rlogp[is.infinite(rlogp)] <- NA
        df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)

        # original
        # ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
        # mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
        # #        mod.lo <- loess(rlogp ~ q, df)

        # Gulliem Salazar's solution
        # https://github.com/benjjneb/dada2/issues/938
        mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),span = 2)

        pred <- predict(mod.lo, qq)
        maxrli <- max(which(!is.na(pred)))
        minrli <- min(which(!is.na(pred)))
        pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
        pred[seq_along(pred)<minrli] <- pred[[minrli]]
        est <- rbind(est, 10^pred)
      } # if(nti != ntj)
    } # for(ntj in c("A","C","G","T"))
  } # for(nti in c("A","C","G","T"))

  # HACKY
  MAX_ERROR_RATE <- 0.25
  MIN_ERROR_RATE <- 1e-7
  est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
  est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE

  # enforce monotonicity
  # https://github.com/benjjneb/dada2/issues/791
  estorig <- est
  est <- est %>%
    data.frame() %>%
    mutate_all(funs(case_when(. < X40 ~ X40,
                              . >= X40 ~ .))) %>% as.matrix()
  rownames(est) <- rownames(estorig)
  colnames(est) <- colnames(estorig)

  # Expand the err matrix with the self-transition probs
  err <- rbind(1-colSums(est[1:3,]), est[1:3,],
               est[4,], 1-colSums(est[4:6,]), est[5:6,],
               est[7:8,], 1-colSums(est[7:9,]), est[9,],
               est[10:12,], 1-colSums(est[10:12,]))
  rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
  colnames(err) <- colnames(trans)
  # Return
  return(err)
}

# check what this looks like
errF_1 <- learnErrors(
  filtFs,
  multithread = TRUE,
  nbases = 1e10,
  errorEstimationFunction = loessErrfun_mod1,
  verbose = TRUE
)
## 287310600 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
##    selfConsist step 2
##    selfConsist step 3
##    selfConsist step 4
##    selfConsist step 5
##    selfConsist step 6
##    selfConsist step 7
##    selfConsist step 8
##    selfConsist step 9
##    selfConsist step 10
errR_1 <- learnErrors(
  filtRs,
  multithread = TRUE,
  nbases = 1e10,
  errorEstimationFunction = loessErrfun_mod1,
  verbose = TRUE
)
## 280925920 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
##    selfConsist step 2
##    selfConsist step 3
##    selfConsist step 4
##    selfConsist step 5
##    selfConsist step 6
##    selfConsist step 7
##    selfConsist step 8
##    selfConsist step 9
##    selfConsist step 10

Option 2: Only enforce monotonicity

loessErrfun_mod2 <- function(trans) {
  qq <- as.numeric(colnames(trans))
  est <- matrix(0, nrow=0, ncol=length(qq))
  for(nti in c("A","C","G","T")) {
    for(ntj in c("A","C","G","T")) {
      if(nti != ntj) {
        errs <- trans[paste0(nti,"2",ntj),]
        tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
        rlogp <- log10((errs+1)/tot)  # 1 psuedocount for each err, but if tot=0 will give NA
        rlogp[is.infinite(rlogp)] <- NA
        df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)

        # original
        # ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
        mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
        # #        mod.lo <- loess(rlogp ~ q, df)

        # Gulliem Salazar's solution
        # https://github.com/benjjneb/dada2/issues/938
        # mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),span = 2)

        pred <- predict(mod.lo, qq)
        maxrli <- max(which(!is.na(pred)))
        minrli <- min(which(!is.na(pred)))
        pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
        pred[seq_along(pred)<minrli] <- pred[[minrli]]
        est <- rbind(est, 10^pred)
      } # if(nti != ntj)
    } # for(ntj in c("A","C","G","T"))
  } # for(nti in c("A","C","G","T"))

  # HACKY
  MAX_ERROR_RATE <- 0.25
  MIN_ERROR_RATE <- 1e-7
  est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
  est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE

  # enforce monotonicity
  # https://github.com/benjjneb/dada2/issues/791
  estorig <- est
  est <- est %>%
    data.frame() %>%
    mutate_all(funs(case_when(. < X40 ~ X40,
                              . >= X40 ~ .))) %>% as.matrix()
  rownames(est) <- rownames(estorig)
  colnames(est) <- colnames(estorig)

  # Expand the err matrix with the self-transition probs
  err <- rbind(1-colSums(est[1:3,]), est[1:3,],
               est[4,], 1-colSums(est[4:6,]), est[5:6,],
               est[7:8,], 1-colSums(est[7:9,]), est[9,],
               est[10:12,], 1-colSums(est[10:12,]))
  rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
  colnames(err) <- colnames(trans)
  # Return
  return(err)
}

# check what this looks like
errF_2 <- learnErrors(
  filtFs,
  multithread = TRUE,
  nbases = 1e10,
  errorEstimationFunction = loessErrfun_mod2,
  verbose = TRUE
)
## 287310600 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
##    selfConsist step 2
##    selfConsist step 3
##    selfConsist step 4
##    selfConsist step 5
##    selfConsist step 6
##    selfConsist step 7
##    selfConsist step 8
## Convergence after  8  rounds.

errR_2 <- learnErrors(
  filtRs,
  multithread = TRUE,
  nbases = 1e10,
  errorEstimationFunction = loessErrfun_mod2,
  verbose = TRUE
)
## 280925920 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
##    selfConsist step 2
##    selfConsist step 3
##    selfConsist step 4
##    selfConsist step 5
##    selfConsist step 6
##    selfConsist step 7
## Convergence after  7  rounds.

Option 3: Only alter loess weights and also enforce monotonicity

loessErrfun_mod3 <- function(trans) {
  qq <- as.numeric(colnames(trans))
  est <- matrix(0, nrow=0, ncol=length(qq))
  for(nti in c("A","C","G","T")) {
    for(ntj in c("A","C","G","T")) {
      if(nti != ntj) {
        errs <- trans[paste0(nti,"2",ntj),]
        tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
        rlogp <- log10((errs+1)/tot)  # 1 psuedocount for each err, but if tot=0 will give NA
        rlogp[is.infinite(rlogp)] <- NA
        df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)

        # original
        # ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
        # mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
        # #        mod.lo <- loess(rlogp ~ q, df)

        # Gulliem Salazar's solution
        # https://github.com/benjjneb/dada2/issues/938
        # mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),span = 2)

        # only change the weights
        mod.lo <- loess(rlogp ~ q, df, weights = log10(tot))

        pred <- predict(mod.lo, qq)
        maxrli <- max(which(!is.na(pred)))
        minrli <- min(which(!is.na(pred)))
        pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
        pred[seq_along(pred)<minrli] <- pred[[minrli]]
        est <- rbind(est, 10^pred)
      } # if(nti != ntj)
    } # for(ntj in c("A","C","G","T"))
  } # for(nti in c("A","C","G","T"))

  # HACKY
  MAX_ERROR_RATE <- 0.25
  MIN_ERROR_RATE <- 1e-7
  est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
  est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE

  # enforce monotonicity
  # https://github.com/benjjneb/dada2/issues/791
  estorig <- est
  est <- est %>%
    data.frame() %>%
    mutate_all(funs(case_when(. < X40 ~ X40,
                              . >= X40 ~ .))) %>% as.matrix()
  rownames(est) <- rownames(estorig)
  colnames(est) <- colnames(estorig)

  # Expand the err matrix with the self-transition probs
  err <- rbind(1-colSums(est[1:3,]), est[1:3,],
               est[4,], 1-colSums(est[4:6,]), est[5:6,],
               est[7:8,], 1-colSums(est[7:9,]), est[9,],
               est[10:12,], 1-colSums(est[10:12,]))
  rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
  colnames(err) <- colnames(trans)
  # Return
  return(err)
}

# check what this looks like
errF_3 <- learnErrors(
  filtFs,
  multithread = TRUE,
  nbases = 1e10,
  errorEstimationFunction = loessErrfun_mod3,
  verbose = TRUE
)
## 287310600 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
##    selfConsist step 2
##    selfConsist step 3
##    selfConsist step 4
##    selfConsist step 5
##    selfConsist step 6
##    selfConsist step 7
##    selfConsist step 8
##    selfConsist step 9
## Convergence after  9  rounds.

# check what this looks like
errR_3 <- learnErrors(
  filtRs,
  multithread = TRUE,
  nbases = 1e10,
  errorEstimationFunction = loessErrfun_mod3,
  verbose = TRUE
)
## 280925920 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
##    selfConsist step 2
##    selfConsist step 3
##    selfConsist step 4
##    selfConsist step 5
##    selfConsist step 6
##    selfConsist step 7
##    selfConsist step 8
##    selfConsist step 9
##    selfConsist step 10

Option 4: Alter loess function arguments (weights, span, and degree) also enforce monotonicity.

loessErrfun_mod4 <- function(trans) {
  qq <- as.numeric(colnames(trans))
  est <- matrix(0, nrow=0, ncol=length(qq))
  for(nti in c("A","C","G","T")) {
    for(ntj in c("A","C","G","T")) {
      if(nti != ntj) {
        errs <- trans[paste0(nti,"2",ntj),]
        tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
        rlogp <- log10((errs+1)/tot)  # 1 psuedocount for each err, but if tot=0 will give NA
        rlogp[is.infinite(rlogp)] <- NA
        df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)

        # original
        # ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
        # mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
        # #        mod.lo <- loess(rlogp ~ q, df)

        # jonalim's solution
        # https://github.com/benjjneb/dada2/issues/938
        mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),degree = 1, span = 0.95)

        pred <- predict(mod.lo, qq)
        maxrli <- max(which(!is.na(pred)))
        minrli <- min(which(!is.na(pred)))
        pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
        pred[seq_along(pred)<minrli] <- pred[[minrli]]
        est <- rbind(est, 10^pred)
      } # if(nti != ntj)
    } # for(ntj in c("A","C","G","T"))
  } # for(nti in c("A","C","G","T"))

  # HACKY
  MAX_ERROR_RATE <- 0.25
  MIN_ERROR_RATE <- 1e-7
  est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
  est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE

  # enforce monotonicity
  # https://github.com/benjjneb/dada2/issues/791
  estorig <- est
  est <- est %>%
    data.frame() %>%
    mutate_all(funs(case_when(. < X40 ~ X40,
                              . >= X40 ~ .))) %>% as.matrix()
  rownames(est) <- rownames(estorig)
  colnames(est) <- colnames(estorig)

  # Expand the err matrix with the self-transition probs
  err <- rbind(1-colSums(est[1:3,]), est[1:3,],
               est[4,], 1-colSums(est[4:6,]), est[5:6,],
               est[7:8,], 1-colSums(est[7:9,]), est[9,],
               est[10:12,], 1-colSums(est[10:12,]))
  rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
  colnames(err) <- colnames(trans)
  # Return
  return(err)
}

# check what this looks like
errF_4 <- learnErrors(
  filtFs,
  multithread = TRUE,
  nbases = 1e10,
  errorEstimationFunction = loessErrfun_mod4,
  verbose = TRUE
)
## 287310600 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
##    selfConsist step 2
##    selfConsist step 3
##    selfConsist step 4
##    selfConsist step 5
##    selfConsist step 6
##    selfConsist step 7
##    selfConsist step 8
##    selfConsist step 9
##    selfConsist step 10
errR_4 <- learnErrors(
  filtRs,
  multithread = TRUE,
  nbases = 1e10,
  errorEstimationFunction = loessErrfun_mod4,
  verbose = TRUE
)
## 280925920 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
##    selfConsist step 2
##    selfConsist step 3
##    selfConsist step 4
##    selfConsist step 5
##    selfConsist step 6
##    selfConsist step 7
##    selfConsist step 8
## Convergence after  8  rounds.

Now for the side-by-side plots:

Forward Reads:

Option 1: Alters the weights and span in loess, also enforce monotonicity errF_plot1

Option 2: Monotonicity only errF_plot2

Option 3: Only alter loess weights and also enforce monotonicity errF_plot3

Option 4: Alter loess function arguments (weights, span, and degree) also enforce monotonicity. errF_plot4

Reverse Reads:

Option 1: Alters the weights and span in loess, also enforce monotonicity errR_plot1 Option 2: Monotonicity only errR_plot2 Option 3: Only alter loess weights and also enforce monotonicity errR_plot3 Option 4: Alter loess function arguments (weights, span, and degree) also enforce monotonicity. errR_plot4

Conclusions:

So far @jonalim's solution seems the best for this data. All methods are still not doing great with the A2G, and T2C transitions, but there also just doesn't seem to be a lot of data for those transitions. I'd welcome anyone else's input on whether or not they think the Option 4 plot is acceptable or still not trustworthy enough to draw conclusions from.

We're trying to develop a lab pipeline and seeing as it looks like we'll be dealing with NovaSeq data going forward, my current recommendation for users of NovaSeq data is going to be to try out all four methods of error learning and choose the plots that look best for the data.

benjjneb commented 2 years ago

BTW if there's interest in others getting the mock data to play with I could ask Jed if he'd mind us sharing

Yes, if you have mock community data from NovaSeq, I'd love to be able to look at it. @jcmcnch

cjfields commented 2 years ago

We're trying to develop a lab pipeline and seeing as it looks like we'll be dealing with NovaSeq data going forward, my current recommendation for users of NovaSeq data is going to be to try out all four methods of error learning and choose the plots that look best for the data.

That's amazing work @hhollandmoritz ! Yeah we've set up a 'NovaSeq' fix in our Nextflow implementation that essentially does option 1, but I agree it's worth making this more flexible based on the above options.

jcmcnch commented 2 years ago

Hi everybody,

An update from me and Liv ( @livycy ) who has done some testing of the improved error model from @JacobRPrice on our mock communities (16S and 18S; clone sequences of nearly-full rRNA) for which we know the true result and have been able to reliably get good denoising from q2-dada2 in previous runs (HiSeq MiSeq etc).

TL;DR - while the error model looks better, we still see little change in the abundance of some artifactual sequences (defined here as 1-mismatches to the true sequence), in particular A=>G and T=>C transitions which had very "flat" error model profiles.

Some more detail:

For our mocks, we normally get < 0.5% of total sequences falling out the exact expected mock sequences after denoising with DADA2 (consistent between q2-dada2 and the native R version) which we've been very happy with. In our past few NovaSeq runs, we're getting up to 10% of reads falling outside the true mocks which is alarming and something we'd really like to resolve. For these sequences that differ from the reference, we know from BLASTing them that that a significant fraction are 1 base away from the true sequences and therefore likely denoising artifacts.

We see the same type of error dips and issues so Liv ran the new error model from @JacobRPrice , with similar results:

Here's the original error model:

image

Then now the improved one:

image

So everything looks nicer. No more dips etc.

However, it doesn't appreciably change the percentage of total ASV counts that differ from the known sequences (3.7% originally vs 3% with the new error model).

I then looked more carefully at the 1-mismatch artifacts from the most abundant member of our 16S mock and did some quick counting by eye (looking at alignments in a viewer, I'm sure there's a smarter way using biopython or something). The location of the variants are scattered across the ASVs though they are definitely more abundant in the reverse read.

From this I can conclude a couple of things (so far this is specific to this particular member of the mock community which happens to be a SAR11):

When looking at the error profiles I do see why DADA2 might not be doing well with A=>G and T=>C transitions (the error profile is pretty flat) but the rest is quite mysterious to me... @benjjneb @hhollandmoritz @jonalim @Y-Q-Si - any suggestions about the next step we could do to try and resolve the poor performance on our mocks?

Thanks, Jesse

jcmcnch commented 2 years ago

Hi all,

I just wanted to send a brief update on this. We dug into some of our previous datasets and found that while our two most recent NovaSeq runs had issues with DADA2 denoising (i.e. many 1-mismatches to the mock, suggeting denoising issues) which were not resolved by improving the error model as indicated above, a previous NovaSeq run had excellent performance with DADA2, and worked nearly as well as a HiSeq rapidrun (which does not have binned quality scores). The reason why our two most recent runs have had such comparatively poor performance is mysterious but it could be some property of the sequencing pools we've prepared and/or the ratio of amplicon to metagenome reads in those pools. We have noticed similar situations in the past where DADA2 did fail for inexplicable reasons and we were not able to figure out why this was happening even after consulting with @benjjneb.

So I would conclude from this that NovaSeq binned quality scores are not inherently a problem for DADA2, at least for our simple mock communities. FYI @cjfields If anyone would like access to these mock community datasets for testing, I would be happy to provide them. @benjjneb would your group be interested to take a closer look? This would include a number of mocks derived from the same exact sequencing libraries run on HiSeq and then on NovaSeq, both of which worked well in terms of denoising and another NovaSeq run (in addition to the one we already shared) which had quite serious issues with the same mocks.

-Jesse

nejcstopno commented 2 years ago

Thanks all for this resourceful thread! I am too working on a large dataset produced with NovaSeq and I am about to test out these models. I am considering doing the testing on a randomly subsetted dataset (maybe 1/10) but don't know if this is sufficient to decide about which model to use for the full dataset. Any comments? I have 970 samples with 200k reads each and if running the error learning steps for all models on the full dataset it will take a couple of weeks (changing the nbases=1e10 alone took nearly 3 days to complete for forward reads only).

JacobRPrice commented 2 years ago

@nejcstopno , IMHO testing on a 10% subset should give you a pretty good idea of what the outcome would be for the full dataset.

As far as your choice of nbases = 1e10, the default is 1e8 and is generally pretty sufficient for most purposes and 1e10 may be larger than the number of reads offered by the (970*0.1 =) 97 samples you're considering to train on. Assuming you have 200k reads per sample and each read is 200 bp long post trimming, it would take 250 samples to reach 1e10.

> 1e10 / (2e5*200)
[1] 250
bjmarzul commented 2 years ago

I'm not sure how I got here, so I'm hesitant to comment, but I'm pretty sure setting span = 2 in loess isn't correct. My understanding is that span is the percent of the data to use to smooth. Often called alpha and usually .75 by default. 75%. Reducing the span (adjusting it closer to 0) will make the line "more wiggly" and less straight. There's a tutorial somewhere on optimizing the smoothing parameter of loess which could be a good place to gather information. As well as the geom_smooth() documentation for ggplot2 which is where I would have probably first heard of loess regression.

cmgautier commented 1 year ago

Hello! I have already analysed by the past metagenomic data with dada2 but this time I have the same issue that you, with probably binned quality score but with illumina MiSeq. I read all the discussion and thank you for information provided. I have some questions for a better understanding:

csmiguel commented 1 year ago

Option 4 (see above), with the so-called function loessErrfun_mod4 described above and it worked great with my data.

cmgautier commented 1 year ago

Option 4 (see above), with the so-called function loessErrfun_mod4 described above and it worked great with my data.

Hi @csmiguel, thanks for your answer. I tried the option 4 but I have this error message: "Error rates could not be estimated (this is usually because of very few reads). Error in getErrors(err, enforce = TRUE) : Error matrix is NULL." Do you have any idea what's happen?

I am not sure if I need to do the step filterandtrim when I look my plot quality profile.. maybe it's the problem? plotqualityprofile

Thank you!

csmiguel commented 1 year ago

I think your quality profiles are OK. You should expect those grey blocks from the binned qualities, with the darkest on top. Try doing filterAndTrim following the standard tutorial from dada2. Maybe the function learnErrors is expecting clean and filtered reads (eg no ambiguous nucleotides Ns).

cmgautier commented 1 year ago

I think your quality profiles are OK. You should expect those grey blocks from the binned qualities, with the darkest on top. Try doing filterAndTrim following the standard tutorial from dada2. Maybe the function learnErrors is expecting clean and filtered reads (eg no ambiguous nucleotides Ns).

I do this: out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,240), maxN=0, maxEE=c(2,5), truncQ=2, rm.phix=FALSE, compress=TRUE, multithread=FALSE)

And after, I tried the option 4 like you suggested and I had the error message that I sent you. When I do only "errF <- learnErrors(filtFs, multithread=TRUE)" (following the standard dada2 tutorial), I have this plot: ploterrors The result is not very nice..

cmgautier commented 1 year ago

Hi @benjjneb, I try to use the option 4 proposed by @hhollandmoritz based on @jonalim's method for my binned quality score. But I have this error message: "105972963 total bases in 476016 reads from 14 samples will be used for learning the error rates. Initializing error rates to maximum possible estimate. Error rates could not be estimated (this is usually because of very few reads). Error in getErrors(err, enforce = TRUE) : Error matrix is NULL."

Can you help me with this problem please?

Thank you so much.

benjjneb commented 1 year ago

@cmgautier Based on the above, I think I may need to reproduce your error on my machine to move forward.

Could you try to make a minimal example dataset (ideally jsut one sample) that creates this error, and then share the data and code so I can recreate it on my own computer?

vfonti commented 1 year ago

Sorry for the deleted message. I was experiencing the same issues as @jcmcnch with the improved learnerror functions. I have explored my reads in search of some problems with one or more samples of mine.. but everything was looking ok. It worked only after restarting R studio. I have no idea why!

jcmcnch commented 1 year ago

Hi all, I just wanted to drop in a note here that the issue we were experiencing with NovaSeq most likely was due to index-hopping rather than any inherent problems with DADA2's error model. For those who don't already know (as we did not!) - it is quite important to have Unique Dual Indices (a.k.a. UDIs) when using the new "Patterned Flow Cells" (i.e. NovaSeq and other newer versions of Illumina instruments) which have much bigger issues with sample-sample bleedthrough. Because of this, we had to discard a whole sequencing run and re-ran it on the older version of the HiSeq that still uses "Bridge Amplification" (NB: there are 2 versions of the HiSeq - we used the old one which is no longer being supported as of early 2023). If your libraries do not have UDIs (i.e. single/combinatorial dual indexing), it's still safe to use MiSeq which will still use "Bridge Amplification" going forward but has really low sequencing depth compared to the newer instruments. Please check with your sequencing facility if you're not sure so you don't generate useless data like we did...

pdhrati02 commented 1 year ago

Hi all, thank you for the detailed discussion. I have 16S v3-v4 data and I have binned quality scores too. image

Trying the normal DADA2 pipeline I realized that the error plots did not show a good fit with dips. image

This is when I tried option 4 as mentioned by @hhollandmoritz (First performed normal filter and trim) out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(280,220), trimLeft=c(17,21), maxN=0,maxEE=c(2,3),truncQ=2, rm.phix=TRUE,compress=TRUE, multithread=TRUE)

and then the option 4. However I get the following error message:

10022445291 total bases in 38108157 reads from 80 samples will be used for learning the error rates. Initializing error rates to maximum possible estimate. Error rates could not be estimated (this is usually because of very few reads). Error in getErrors(err, enforce = TRUE) : Error matrix is NULL. In addition: Warning message: In rval[, 1:ncol(tt)] + tt : NAs produced by integer overflow

Can someone help me as to why this is happening?

Also, I did read a lot of issues regarding this and wanted to make sure I understand it right that there is no single solution to this @benjjneb ?

Thanks in advance, DP

hhollandmoritz commented 1 year ago

Hi @pdhrati02! I'm not sure what's going on exactly, but @cmgautier had the same error as you earlier last fall and maybe has since found a solution?

cmgautier commented 1 year ago

@cmgautier Based on the above, I think I may need to reproduce your error on my machine to move forward.

Could you try to make a minimal example dataset (ideally jsut one sample) that creates this error, and then share the data and code so I can recreate it on my own computer?

Hi @benjjneb, I am sorry for my late answer (long hiatus). I tried it again and I still have this problem.. Can I send you @benjjneb an example dataset to try to find the problem as you proposed?

My problem: I try to use the option 4 proposed by @hhollandmoritz based on @jonalim's method for my binned quality score. But I have this error message: "105972963 total bases in 476016 reads from 14 samples will be used for learning the error rates. Initializing error rates to maximum possible estimate. Error rates could not be estimated (this is usually because of very few reads). Error in getErrors(err, enforce = TRUE) : Error matrix is NULL."

Thank you so much for your help!

NJOram commented 1 year ago

I have the same error as @cmgautier using the 4 options proposed by @hhollandmoritz and @jonalim (16S V4-V5 data from Illumina NextSeq). Happy to provide data if that helps @benjjneb!

Many thanks!

pdhrati02 commented 1 year ago

Hi @cmgautier @benjjneb @hhollandmoritz I was wondering if anyone has any suggestions as to how to tackle this issue we are facing with using the option 4 suggested earlier? I will happy to share an example dataset if anyone wishes to recreate the error on their end. (This is nextseq data)

Any help would be appreciated. Thank you in advance.

hhollandmoritz commented 1 year ago

@pdhrati02 I don't know what might cause the error, so I'll be limited help in troubleshooting, but I could try running your data on our server with our code and let you know if I get the same error. That should at least give you a place to start to understand whether it is a data thing or a code thing. (I've never used option 4 on nextseq data, only novaseq, while I don't think it would be an issue, it wouldn't hurt to try.)

pdhrati02 commented 1 year ago

Hi @hhollandmoritz thank you for your response, this would be great help. can you share your email ID and then I can share 2-4 sample sequence data with you and we can try to solve this issue? Also, do you think this might be happening as it is NextSeq?

This is what we know about Nextseq and binned scores: The NextSeq 2000 uses Real Time Analysis (RTA) software v3 for image analysis and basecalling, including Q score assignment. RTA3 assigns 1 of 4 Q scores to each base call; 0, 12, 26 or 34 for runs of 300 cycles or less, 0, 9, 20, 34 for runs of 600 cycles. See slide 17 of the attached ‘NextSeq 1000/2000 Run Monitoring’ presentation. The binning of Q scores reduces the data footprint of the analysis, allowing more data to be handled simultaneously, allowing higher output of runs. This is compared to earlier versions of RTA on eg the MiSeq where each basecall was assigned a specific Q score.

With dada2: As you only assign 4 scores using RTA3 this is why there is not a smooth continuous line with inverse relationship between estimated error rate and Q score, and instead you see a step-wise pattern; because you’re not inputting continuous Q score data but rather 4 discrete Q scores. The Q score binning model underestimates Q score if anything; any base call with Q score 11 to 24 will have a Q score of 11 assigned. This might mean that from NextSeq 1000/2000 or NovaSeq data you see lower estimated error rate by Q score than in MiSeq data.

Any help would be appreciated. Thank you

cmgautier commented 1 year ago

Hi @hhollandmoritz and @pdhrati02 . I also still have this problem with my illumina dataset. I am interesting about your findings or if you want me to send you some data to try, it's possible. Let me know. Thank you so much for your help.

benjjneb commented 1 year ago

Thanks for the ongoing input here. I've blocked out some time on my calendar in two weeks to try to make some progress towards a more officially supported solution here.

hhollandmoritz commented 1 year ago

@pdhrati02, you can send it to hannah.holland-moritz@unh.edu @cmgautier if you have a similar number of samples, I can try those as well on our server.

I'll try to run these as soon as possible, but just a warning that it might take a little longer, given it's the end of the semester here and we're preparing for field season.

@benjjneb Something to consider as well as you're working on this (which I'm sure you already are thinking of). We find lots of polyGs in our novaseq data, leading to mid-read dips in coverage followed by (false) high quality scores. Our workflow right now uses cutadapt's poly-G removal option to trim these back before we use dada2's filter and trim function. However, I noticed in https://github.com/benjjneb/dada2/issues/1730#issuecomment-1518201613 that you thought the error model may be tripped up by this practice. We haven't seen any evidence of poorer error model fits with samples with lots of polyGs that have been trimmed, but if there's going to be an officially supported solution, it might be nice to also consider the polyG trimming at the same time.

cjfields commented 1 year ago

@benjjneb Something to consider as well as you're working on this (which I'm sure you already are thinking of). We find lots of polyGs in our novaseq data, leading to mid-read dips in coverage followed by (false) high quality scores.

I can also confirm this, though they tended to be pretty rare unless the amplicon size is super-short or samples are low input (high background).

EDIT: If I'm reading the issue correctly, I think the issue isn't necessarily cutadapt's trimming but the use of quality-based trimming which leaves variable length reads ('ragged ends'). So, pretty much any trimming tool that can do this (trimmomatic, fastp, cutadapt) has the same issue.

Also, would low complexity filtering catch those with polyG?

hhollandmoritz commented 1 year ago

@cjfields That's really interesting information. We see them quite frequently and haven't connected them to either low input or short amplicons (so far, it's seemingly random, so I just assumed it was an artifact of the other samples that were sequenced on the same run).

Your edit aligns with my interpretation of Issue #1730 which is that the variable length messes with the error learning. So, in theory, fungal ITS amplicons which commonly have variable length could also pose this issue regardless of the sequencing method used. However, in practice, I've not noticed any difference between error rate estimations from runs with lots of trimmed polyGs and runs with very few trimmed polyGs. I'm not clear if this is simply because the polyG reads are subsequently filtered out (they typically have error profiles that may lead to subsequent filtering in the filter and trim step), or if it's because the error models are not as sensitive to the truncated sequences as they are to the binned quality scores.

I've only learned about the low-complexity filtering option in reading the issue a couple days ago, so I don't know how good it is at catching them, but I'm eager to try.

And if anyone wants a couple of NovaSeq samples with lots of polyGs to play with, I'm happy to share some. (ETA: we have had some particularly egregious examples lately - see attached). fwd_qual_plots (1)

cjfields commented 1 year ago

@hhollandmoritz very interesting! I agree re: ITS, those in particular are thorny for a number of reasons, including the potential for no overlap. Those are definitely worth a look; I'm wondering whether the merging step (which I think has the overhang trimming) might get rid of these.

hhollandmoritz commented 1 year ago

@cjfields The merging step is also a possibility however, if a sufficiently large number of G's are left on the end of a sequence, I'd hypothesize that the merging would go fine and create artificial ASVs with lots of G's in the center. That was part of our original reasoning behind removing them, once we finally figured out what the dips were from. (I certainly panicked the first time I saw a student with a plot like this).

pdhrati02 commented 1 year ago

@hhollandmoritz thank you very much for offering to help. I have shared a few files for a trial run with you now.

cmgautier commented 1 year ago

@pdhrati02, with @hhollandmoritz we maybe found the problem (it works for my dataset) for learning error rates with NovaSeq data: it's necessary to load library(dplyr); packageVersion("dplyr") in addition to Dada2 package. I hope it will work for you too.

cjfields commented 1 year ago

@cjfields The merging step is also a possibility however, if a sufficiently large number of G's are left on the end of a sequence, I'd hypothesize that the merging would go fine and create artificial ASVs with lots of G's in the center. That was part of our original reasoning behind removing them, once we finally figured out what the dips were from. (I certainly panicked the first time I saw a student with a plot like this).

Ah I see what you mean, that the polyG would be longer than the amplified fragment:

5' ---------GGGGGGGGGGGGGGGG           3'
3'          GGGGGGGGGGGGGGGG---------  5'

Probably a number of ways to handle that, but I agree the easiest is to simply remove it.

davised commented 1 year ago

Hi all. Figured I'd chime in re: the poly Gs since we have had to deal with those and had lengthy talks with illumina about this topic.

  1. In the NextSeq control software v1.4.1 and earlier, there is a bug where Q score 0 (N) base calls are incorrectly assigned a base in the output. The quality score is correct (0) but the base should be assigned N, not something else.
  2. Additionally, on short inserts with read-through, the NextSeq and other 2 laser machines output strings of G sequences when there is no base to call. So if you have an unreasonably short insert, you can get those long strings of G, even with good Q scores.

In general, I'm usually not too worried about things like those getting included with the outputs because I typically expect them to fail at the merge step, or if they somehow get through that, after I discard ASVs with low quality annotation (nothing below kingdom/domain, etc).

Maybe it's a bigger issue on the nextseq, though.