benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

removeBimera never ending #1609

Closed ChloePZS closed 3 months ago

ChloePZS commented 1 year ago

Hello, I have a V3-V4 16s DNAr metabarcoding dataset of 68 samples with a read depth of 400-900K reads/sample. Samples were sequenced on a NovaSeq 6000 PE250. Primers were removed with Cutadapt, sequences were all filtered & I followed the workflow for big data. After the merging step, I end up with 1 298 725 ASV.

I tried to run removeBimeraDenovo with default values and multithreading (12 cores processor). But as the estimation was over 18 days (still increasing), I stopped the process. So how could we optimize the processing time of the function? :

Many thanks for your help!

Best, Chloe

benjjneb commented 1 year ago

Although not true in most cases, chimera removal can become the most computationally costly step in some datasets. There are tools to address this though, and also possible to better estimate how long the process will take.

Could the multithreading actually saturate the machine? Would it be better to use only 1 core or more?

Not sure what this means exactly. One thing I would check is that you have enough memory, and that you aren't running into issues with "swapping" between memory and the filesystem, which will dramatically slow computation. If submitting on a cluster, make sure you are requesting sufficient memory, and check how much memory is being used in an example run of that dataset. Memory requirements are not meaningfully affected by multithreading, so you definitely will want multithreading turned on.

What method would you advise: consensus vs per-sample ? Or other function arguments?

pool=TRUE for chimera identification is only recommended if using the pool=TRUE mode during dada denoising. Based on what you've said about following the Big Data workflow, you should use the default "consensus" mode of chimera removal.

Would it be better to remove all short sequences beforehand?

What is the length distribution in your data? There is a possibility that the data processing you did with cutadapt introduced artefactual length variation into the dataset. In general it is totally valid to enforce a length distribution (e..g min/max sequence lengths) as a filtering strategy. Removing ASVs will reduce the size and therefore running time of the sequence table.

Is my dataset that big ?!

~1M ASVs should be computationally tractable. It is a bit surprising to see that many unique ASVs from just 68 samples of ~500k reads per sample. What environment is being sampled here?

Some general ideas to try: Try to make sure that primer removal is being handled correctly. If you are using a standard V3V4 approach like the "Illumina" library approach in which the fixed length primers appear at the start of the reads, you will get better results by removing them using filterAndTrim(...,trimLeft=c(FWD_PRIMER_LEN, REV_PRIMER_LEN)` than using cutadapt, which could be introducing issues into the processing depending on exactly what flags you are using.

You can enforce a minParentAbundance threshold in removeBimeraDenovo. This will reduce the number of ASVs considerd as possible chimeric "parents" and thereby reduce running time. For your dataset, setting minParentAbundance=10 for example might cut down running time quite a bit without much negative effect in terms of chimera detection.

You could look at running time as a function of including 1, 2, 4, 8 ... samples to get a sense of how time is scaling with dataset size, to get a better idea how long 68 samples should take.

ChloePZS commented 1 year ago

Thanks a lot for your quick reply!

The process doesn't do any swapping and we haven't use any cluster either. The computer I am using has 12 cores (2.4GHz) and 16gb of RAM. I did a test with two samples (40 159 ASV), and it took ~2h by using multithreading on 11 cores and 'consensus' method.

After checking the ASVs distribution across samples, I noticed that ~1M ASV were only detected into 1 sample... image

So you were right, I shouldn't have that much unique ASV ! Most of those ASV must be erroneous. Samples are from coral reef water filters (5L), 12 replicates for each level of treatment + blanks

ASV length distribution is indeed wide, with quite a lot of short sequences. image

The sequencing platform used 4 different primer sequences that vary from 1-3nt. Those extra bases were present before the FWD & REV primers. Hence, I couldn't use trimLeft as my primers were of different lengths. So I used Cutadapt after having removed N bases with the primers (341F - 805R) : FWD <- "CCTACGGGNGGCWGCAG"
REV <- "GACTACHVGGGTATCTAATCC" But Cutadapt is definitely producing reads of different length, which I wasn't sure how to deal with.

In the filterAndTrim(...), I used truncLen = c(226,217), to remove the last 10nt of FWD & 5nt of REV reads considering the minimum length the reads would have. But it seems that it wasn't the right way to go. As you suggested, it would have been better to set a min & max length then? Reads are of quite good quality (below the raw reads), maybe there is no need of truncating the ends? image

Actually when running plotQualityProfile on my cutFs & cutRs reads, I had the error "Error in density.default(qscore) : 'x' contains missing values". So there is definitely something wrong with my cutadapt reads...

Many thanks again, Chloé

benjjneb commented 1 year ago

The sequencing platform used 4 different primer sequences that vary from 1-3nt. Those extra bases were present before the FWD & REV primers. Hence, I couldn't use trimLeft as my primers were of different lengths. So I used Cutadapt after having removed N bases with the primers (341F - 805R) : FWD <- "CCTACGGGNGGCWGCAG" REV <- "GACTACHVGGGTATCTAATCC" But Cutadapt is definitely producing reads of different length, which I wasn't sure how to deal with.

So this is almost certainly the source of your problems here. The "heterogeneity spacers" approach was developed to create heterogeneity in the sequenced bases in amplicon libraries as a way to help the Illumina base-calling calibration. But it requires careful and accurate removal of those primers in order to work with DADA2. Basically, DADA2 makes a pretty strong assumption that every read is at least starting in the same position, and if that isn't the case, it creates a variety of problems downstream.

We do not have a pre-built solution for length-varying primer designs. I bet cutadapt can manage this, but cutadapt has a lot of parameters and from experience it can behave different than expected sometimes. Do you have any resources/contacts with the folks who performed this sequencing? Ideally they have already developed a solution for trimming off these variable length primers.

ChloePZS commented 1 year ago

I checked the presence of primers + additional bases before & after Cutadapt Exple for a sample in FWD image

Cutadapt removes the primers and everything before, though I noticed some errors in the primers (e.g. read 3 above). Reads have length variation from 1-3b as expected. There are some primers still present, but it's very minor compared to before cutadapting image image

Despite length variation, do you think DADA2 is still suitable in my case?

I tried using minLen = 200 in filterAndTrim on two of my samples, and the number of ASVs has decreased by 2fold compared to when using trunLen. But I still obtained quite a high number of ASV with 7453 non chimeric sequences with those two samples.

Many thanks again for your insights

benjjneb commented 1 year ago

DADA2 can still work, but the more of those unremoved primers there are, the more issues are going to crop up. My concern is that despite what the primerHits function is saying, there clearly are unremoved primers on e.g. sq3. And as a very rough estimate, 1/10th of your reads may retain unremoved primers.

That said, sq3 points out another issue: That sequence has a large polyG tail that is a common error-type of two-color Illumina chemistries. There may be a significant amount of low complexity sequence "contamination" of your read set.

Try plotComplexity("path/to/reads.fastq") to look at this more closely. Is there a significant low complexity mode (probably dominated by reads with large polyG tails)? Removing that set of reads might help as well.

ChloePZS commented 1 year ago

Thanks a lot for coming back to my issue. plotComplexity (after cutadapting & filtering) gives me something quite good I think : image image

Would you still add "rm.lowcomplex" during filterAndTrim ?

I did some extra tests : added "maxLen = 250" to filterAndTrim so I can get rid of those 251bp reads with unremoved primers. They represented only a small proportion as I still kept ~ 80-90% of the reads.

However, after that, I still obtained most of the ASV detected in a single sample. image

I realized my NovaSeq data could be the problem due to the binned quality scores. I found the issue #1307 .... My error rate estimates were indeed pretty bad...and characteristic of what other users obtained with NovaSeq data. image

Do you think this could explain the inflated number of singletons in my data?

I will try to run the different modified error rate estimation functions and see what gives me the best estimates, and then try again on my test samples.

Cheers, Chloe

benjjneb commented 1 year ago

I don't see any obvious explanation for having such a large number of single-sample ASVs from the diagnostics you've posted so far.

One thing I would look at is the relative abundance that is accounted for by these single-sample ASVs. Are they a bunch of really rare things? Or do they make up a decent fraction of the total reads in some or many samples?

I would probably also do some exploratory BLAST-ing of representative single-sample ASVs (and as a comparative group, cosmpolitan ASVs). Does this suggest anything?

ChloePZS commented 1 year ago

Hi Benjamin,

I ran the different functions for error rates estimation and the best plots were obtained when altering loess arguments (weights and span) & enforcing monotonicity (below for FWD reads). image

I proceeded over my 68 samples and after merging, I obtained ~700K ASVs which is more reasonable. But I still have ~90% of those that are sample-specific image

However, those sample-specific didn't account for an important fraction of total reads/sample image image

Along the pipeline, reads loss is totally acceptable I think and I obtained ~200K non-chimeric ASVs image

I haven't been able to assign taxonomy over all of the 200K ASVs (our server crashes systematically), but with a test with 2 samples (~16K ASVs). ASVs that couldn't be assigned to a Phylum were all sample-specific, and 36% of the Phylum were found only in sample-specific ASVs. I did a few BLASTing and it matched.

At this stage, as you said, I don't see any further explanation for this observed ASVs distribution pattern across samples. Could it be due to the very high sequencing depth of NovaSeq technology, which may just detect rare taxa?

Would it be alright to keep forward by getting rid of the sample-specific ASVs? Which would give me 18 235 ASVs (present in at least 2 samples) to work with across my 68 samples.

Many thanks again! Chloe

benjjneb commented 1 year ago

Would it be alright to keep forward by getting rid of the sample-specific ASVs? Which would give me 18 235 ASVs (present in at least 2 samples) to work with across my 68 samples.

Yes, that kind of filtering is common and acceptable, especially since you've established that these account for a low fraction of total reads, and ruled out obvious computational mistakes as a cause. That said, remember to report that step in your eventual publication.

ChloePZS commented 1 year ago

I'll go ahead and do that then ! Many thanks again for all your insights, it's been very helpful.