benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
464 stars 142 forks source link

learn Error issue #665

Closed FloraVincent closed 5 years ago

FloraVincent commented 5 years ago

Hi everyone, I have paired end data from illumina Miseq, 150 bp (115 after trimming). I just started with one sample to check all was good (not my first time with DADA2). I use dada2 to trim the primers using TrimLeft with the folllowing command:

out <- filterAndTrim(fnFs[9], filtFs[9], fnRs[9], filtRs[9], truncLen=c(145,145), maxN=0, maxEE=c(2,2), truncQ=2, trimLeft = c(30,29), rm.phix=TRUE, compress=TRUE, multithread=TRUE, matchIDs = TRUE)

Below is the output of "errF <- learnErrors(filtFs[9], multithread=TRUE)" (filtFs[9] is just one of my fastQ R1):

Initializing error rates to maximum possible estimate. Sample 1 - 154185 reads in 79013 unique sequences. selfConsist step 2 selfConsist step 3 selfConsist step 4

It is taking a LOT of time compared to my previous runs, around 3h just to ouput the previous line. It's running on a big server, so I wonder if something is wrong, either in the data or in the commands. Several questions:

Thanks for help and technical support on dada2 ! Flora

benjjneb commented 5 years ago

Is it expected to take so much time? In which case I just let it run

Time will increase with the number of unique sequences. But no, 79k is not a lot of unique sequences, and I wouldn't expect this to take that long especially on a server with multiple cores available. One thing to check, is the process using multiple threads? It is possible it is only getting access to a single thread. For example, many common submission systems (e.g. slurm) require you to specify the number of threads or cores you want, otherwise you'll get the default which is usually 1 or 2.

I suspect human contamination in the data set (that will reasonably increase the fastq file); would it help to get rid of the contaminants directly in the fastq files before all the dada processing ?

Yes, removing human contamination first is highly recommended. This will also speed everything up later.

FloraVincent commented 5 years ago

Hi Ben,

Thanks a lot for the quick response. By digging into the data it looks like a good 30% of the run are contaminants, with sequences that diverge up to 50% from the one I'm looking for. I will obviously get rid of those. Could this be the origin of the long computation time ? The fact that sequences diverge so much ? All the best, Flora ps: for info the learn error just ended and it does look really bad... image

benjjneb commented 5 years ago

Could this be the origin of the long computation time ?

Yes, such a high fraction of contaminants will both greatly increase computation time, and lead to poor error model estimation.

FloraVincent commented 5 years ago

Hi,

Once the removal of contaminants was done I was able to run the learnError much faster but I (of course) encountered a new problem: I realised R1 and R2 contained both Forward and Reverse primer sequence. Never happened in my MiniSeq before (but I did not prepare the library personnally) apparently it has a few threads but I couldn't really find a way through. To check it out I borrowed your code from https://benjjneb.github.io/dada2/ITS_workflow.html using my 5'-3' sequences for REV and FWD in the original fastq file and got the following table. Very weird. image

In order to follow with the DADA2 pipeline, it means I need to transfer all the FWD.ReverseReads_Forward from R2 to R1 and all the REV.ForwardReads_Forward from R1 to R2, so that all my FWD primers are in R1 and all my REV primers are in R2.

So I just tried that creating a new R1 with only FWD primers and R2 with only reverse and get an error at the filterandTrim.

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(145,145),

  • maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=FALSE,
  • compress=TRUE, multithread=TRUE, matchIDs = TRUE) Error: subscript is a logical vector with out-of-bounds TRUE values In addition: Warning message: In width(fqF) >= startF & width(fqR) >= startR : longer object length is not a multiple of shorter object length

I thought that the matchIDs = TRUE would overcome the different length of R1 (35729) and R2 (35758) and discard those that don't have matching coordinated.

Thanks, Flora

benjjneb commented 5 years ago

I thought that the matchIDs = TRUE would overcome the different length of R1 (35729) and R2 (35758) and discard those that don't have matching coordinated.

Me too! Not sure what is going on there. Does Miniseq data keep the same read ID line format as Miseq/Hiseq?

The easiest workaround would be to just filter R1 and R2 by their being a FWD primer on the R1 and a REV primer on the R2.

If you could share a (potentially cutdown) sample which throws that error, I'd be happy to take a look too. Could be a bug.

FloraVincent commented 5 years ago

Yes sure. Here's the script and the Let me know when you download it so I can delete that link :)

Extract reads of interest by loose matching

Read in original fastq

R1_reads=readFastq("/B3-15_S90_L001_R1_001.fastq.gz") #gives details about your file R2_reads=readFastq("/B3-15_S90_L001_R2_001.fastq.gz") #gives details about your file

Grab the Primers (including RC)

FWD_R1_reads = R1_reads[Gene_Fwd_filter(R1_reads)] FWD_R2_reads = R2_reads[Gene_Fwd_filter(R2_reads)] REV_R1_reads = R1_reads[Gene_Rev_filter(R1_reads)] REV_R2_reads = R2_reads[Gene_Rev_filter(R2_reads)]

Rewrite R1 with all the FWD # might be the source of problem in ID formatting

writeFastq(FWD_R1_reads, "/Gene/B3-15_S90_L001_R1_001.fastq.gz", mode="w", full=TRUE, compress=TRUE) writeFastq(FWD_R2_reads, "/Gene/B3-15_S90_L001_R1_001.fastq.gz", mode="a", full=TRUE, compress=TRUE)

rewrite R2 with all the REV

writeFastq(REV_R1_reads, "/Gene/B3-15_S90_L001_R2_001.fastq.gz", mode="w", full=TRUE, compress=TRUE) writeFastq(REV_R2_reads, "/Gene/B3-15_S90_L001_R2_001.fastq.gz", mode="a", full=TRUE, compress=TRUE)

[ Perform classic first steps of DADA2 fnFs, fnRs etc]

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(150,150), # maxN=0, maxEE=c(2,2), truncQ=2, trimLeft = c(28,28), rm.phix=FALSE, compress=TRUE, multithread=TRUE, matchIDs = TRUE)

===> Error message below Error: subscript is a logical vector with out-of-bounds TRUE values In addition: Warning message: In width(fqF) >= startF & width(fqR) >= startR : longer object length is not a multiple of shorter object length

I do indeed think that the write fastq step creates a fastq file that doesn't completely conserve the formatting of the ID. match(id(R1), id(R2)) gives a vector full of FALSE. I checked, the "filtered" fastq file R1 and R2 (just based on the primers) both contain this ID 11105:24584:1830 (and probably many others) but just not recognized as is for the moment by the filterAndTrim.

Thanks for your (previous) help !! Flora

benjjneb commented 5 years ago

The error is arising because there is at least one read that is being written to both the forward and reverse files, leading to multi-matching between the F/R IDs. This is unexpected behavior and would never occur in standard fastq files. This happens because this read has a lot of Ns, and thus matches both primers and then gets written to both new fastq files.

You should be able to solve the filtering bug by excluding reads that would be written to both the F and R files in your writeFastq code.

Edit: These 6 reads are getting written to both the forward and reverse files...

table(Gene_Fwd_filter(R2_reads) & Gene_Rev_filter(R2_reads))

FALSE TRUE 183814 6

FloraVincent commented 5 years ago

Hi Benjamin,

Thanks for the help! I managed to get rid of the learning error. The whole step is quite efficient and leads to decent error plots but I bumped into a new issue with the variants. screen shot 2019-02-07 at 17 03 02

The merging step is behaving weird. I'm supposed to have at least 4bp overlap between Fwd and Reverse, yet, computing the following line of code: mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, minOverlap = 3, maxMismatch = 0, verbose = TRUE)

Produced this: 0 paired-reads (in 0 unique pairings) successfully merged out of 10237 (in 6 pairings) input. 0 paired-reads (in 0 unique pairings) successfully merged out of 22333 (in 6 pairings) input. 0 paired-reads (in 0 unique pairings) successfully merged out of 36576 (in 3 pairings) input. etc...

I worked around by doing concatenate = TRUE. Aligning the sequences confirms I have the possibility to overlap, yet DADA doesnt seem to want to. screen shot 2019-02-07 at 17 07 30

I decide to move on and look at the abundance table. Here something quite weird appears: screen shot 2019-02-07 at 17 08 45 The first column are samples. The first sequence is obviously the most abundant. Fair enough. But sequence number 2-3 have suspiciously similar abundances, like 6-7. It looks more like an error, than two real nucleotide variants. Unless I have two copies of the same gene in a single organism, which would explain that the read count of the two variants are similar ?

All the best, thanks for you help ! It's priceless.

Flora

benjjneb commented 5 years ago

That is surprising that the merging is failing...

Can you post the output of the following?

mm <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, minOverlap = 3, maxMismatch = 0, returnRejects=TRUE)
mm[[1]]][1:10, -1]

But sequence number 2-3 have suspiciously similar abundances, like 6-7. It looks more like an error, than two real nucleotide variants. Unless I have two copies of the same gene in a single organism, which would explain that the read count of the two variants are similar ?

Yep. It is almost certainly two different alleles from the same organism. Errors would never appear in such constant ratios sample-to-sample, but if you have two copies of your marker-gene that are differentiable in the sequenced region, you will get exactly this pattern.

FloraVincent commented 5 years ago

Hi Benjamin,

You just made my day with your last sentence.

For the output:

mm[[1]][1:10, -1] abundance forward reverse nmatch nmismatch nindel prefer accept 1 9507 1 1 44 17 12 2 FALSE 2 359 1 2 44 17 12 1 FALSE 3 222 2 1 47 19 11 2 FALSE 4 137 3 1 47 19 11 2 FALSE 5 7 2 2 47 19 11 2 FALSE 6 5 3 2 47 19 11 2 FALSE NA NA NA NA NA NA NA NA NA NA.1 NA NA NA NA NA NA NA NA NA.2 NA NA NA NA NA NA NA NA NA.3 NA NA NA NA NA NA NA NA

For info, this is what I use for the filter and trim out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(150,150), # maxN=0, maxEE=c(2,2), truncQ=2, trimLeft = c(28,28), rm.phix=FALSE, compress=TRUE, multithread=TRUE, matchIDs = TRUE) # On Windows set multithread=FALSE I have 28bp primers on both Fwd and Rev reads.

I don't know how to thank you for your help, incredibly useful and precious. Aknowledgements ?

Best Flora

benjjneb commented 5 years ago

Which version of the dada2 R package are you using? I suspect it might be a pre 1.8 version, we made a slight change to the default alignment parameters in version 1.8 that I think will fix the issue you are seeing, i.e. that a long non-perfect alignment is being preferred over a perfect but very short alignment between the sequences.

I don't know how to thank you for your help, incredibly useful and precious. Aknowledgements ?

Glad it is helpful!

FloraVincent commented 5 years ago

Hi,

Once again following your advice was useful. Just reran everything with updated versions of R and DADA2. Here's the output of "mm <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, minOverlap = 3, maxMismatch = 0, returnRejects=TRUE, verbose = TRUE) mm[[1]][1:10, -1]" abundance forward reverse nmatch nmismatch nindel prefer accept 1 7465 1 1 4 0 0 2 TRUE 2 2031 2 1 0 0 0 2 FALSE 3 272 1 2 4 0 0 1 TRUE 4 221 3 1 4 0 0 2 TRUE 5 137 4 1 4 0 0 2 TRUE 6 90 2 2 0 0 0 1 FALSE 7 9 3 2 4 0 0 2 TRUE 8 4 4 2 4 0 0 2 TRUE NA NA NA NA NA NA NA NA NA NA.1 NA NA NA NA NA NA NA NA

Perfect merging and coherent bimeras detection.

The new table looks like that. screen shot 2019-02-11 at 16 25 52 Seems that the duplicated gene is still there in Column H-I but I guess that in K-L it's no longer the same right ? (in the older version of DADA2, I had another pair of ANV that had similar abundance)

Moreover I'm not sure how to interpret the low abundant ANV... Considering the way DADA2 is built they should be biologically real but it seems quite surprising to have such low abundant variants.

Best, Flora

benjjneb commented 5 years ago

Moreover I'm not sure how to interpret the low abundant ANV... Considering the way DADA2 is built they should be biologically real but it seems quite surprising to have such low abundant variants.

There is no one-size-fits-all answer for the best way to deal with low abundance variants. DADA2 typically does a good job of correcting sequencing and PCR errors, but other types of artefacts can also come in at low frequences, especially contamination. However, most microbial communities are made up of more than a few species and will have low abundant variants as part of them. I'd ask yourself if low abundance variants are important for the analysis you want to do, and then proceed accordingly.

SWambua commented 5 years ago

I'm working with 6 samples sequenced single-end with Ion Torrent. All seems to work well on DADA2 till the error learning step where: 1) there is a notification that a given number of sequences and reads from 4 (not 6) samples are being used to learn error. Is it OK that dada2 isn't using all 6 samples?

2) the step takes too long. I quit after 2hrs or so. I'm using windows and did set mutlithread at FALSE. Is this normal? I mean, the 20 samples from the tutorial ran in less than 2 minutes.

Thanks for all your help.

S.

benjjneb commented 5 years ago

there is a notification that a given number of sequences and reads from 4 (not 6) samples are being used to learn error. Is it OK that dada2 isn't using all 6 samples?

Yep that is normal. learnErrors usually uses a large enough subset of the data to learn the error model to cut down on computation time.

the step takes too long. I quit after 2hrs or so. I'm using windows and did set mutlithread at FALSE. Is this normal? I mean, the 20 samples from the tutorial ran in less than 2 minutes.

learnErrors is going to be the most computationally costly step in your workflow given what you've described about your samples. It's not unexpected for it to take a while, especially because you aren't using multithreading. I'd recommend setting multithread=TRUE if you can, it will speed things up a fair amount.

The tutorial is intentionally using smaller datasets to let people run through it quickly, most people's real data will take longer. You can speed things up a bit by filtering more stringently earlier on, but I think you just need to turn multithreading on and give it some time to process.

SWambua commented 5 years ago

Many thanks @benjjneb for your prompt and helpful response, as always. I finished the learnErrors step in 3hrs - I am not sure if setting multithread = TRUE helped. I am told multithreading isn't supported on windows.

I have a new issue. The error plots look strange, I am not sure I know what this pattern is consistent with: 1st Error Learn Plots

As I had mentioned, mine is 16S sequence single-ended done with Ion Torrent where the fastq files came to me with primers (V2-4-8) already removed. Contamination is, therefore, very unlikely but perhaps I set the wrong filtering/trimming parameters? out <- filterAndTrim(reads, filt.reads, maxLen = 300, maxN=0, rm.phix=TRUE, compress=TRUE, multithread=FALSE)

benjjneb commented 5 years ago

@SWambua In previous experience with 454 pyrosequencing, these sorts of "flat" error models were pretty normal actually, because pyrosequencing makes few substition errors, the quality scores mostly reflected homopolymer indles, and thus the substitution error rates were driven by PCR and mostly independent of the quality score.

I have almost no experience with Ion Torrent, so take that into account, but while the flatness doesn't concern me, the high levels of errors being detect do somewhat. Perhaps this is normal for Ion Torrent, but those are pretty high error rates (>1%). You can still use it, a too-high error-rate model won't lead to wrong results, but it will reduce sensitivity to rare variants in your samples.

SWambua commented 5 years ago

Many thanks @benjjneb. This makes sense... yes the error rate is high but I think this is well within consistency with IT sequences; most publications I have seen Phred scores of 20, a few above this.

Again thanks,

Sammy