benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 143 forks source link

removeBimera never ending #1609

Closed ChloePZS closed 6 months ago

ChloePZS commented 2 years ago

Hello, I have a V3-V4 16s DNAr metabarcoding dataset of 68 samples with a read depth of 400-900K reads/sample. Samples were sequenced on a NovaSeq 6000 PE250. Primers were removed with Cutadapt, sequences were all filtered & I followed the workflow for big data. After the merging step, I end up with 1 298 725 ASV.

I tried to run removeBimeraDenovo with default values and multithreading (12 cores processor). But as the estimation was over 18 days (still increasing), I stopped the process. So how could we optimize the processing time of the function? :

Many thanks for your help!

Best, Chloe

benjjneb commented 2 years ago

Although not true in most cases, chimera removal can become the most computationally costly step in some datasets. There are tools to address this though, and also possible to better estimate how long the process will take.

Could the multithreading actually saturate the machine? Would it be better to use only 1 core or more?

Not sure what this means exactly. One thing I would check is that you have enough memory, and that you aren't running into issues with "swapping" between memory and the filesystem, which will dramatically slow computation. If submitting on a cluster, make sure you are requesting sufficient memory, and check how much memory is being used in an example run of that dataset. Memory requirements are not meaningfully affected by multithreading, so you definitely will want multithreading turned on.

What method would you advise: consensus vs per-sample ? Or other function arguments?

pool=TRUE for chimera identification is only recommended if using the pool=TRUE mode during dada denoising. Based on what you've said about following the Big Data workflow, you should use the default "consensus" mode of chimera removal.

Would it be better to remove all short sequences beforehand?

What is the length distribution in your data? There is a possibility that the data processing you did with cutadapt introduced artefactual length variation into the dataset. In general it is totally valid to enforce a length distribution (e..g min/max sequence lengths) as a filtering strategy. Removing ASVs will reduce the size and therefore running time of the sequence table.

Is my dataset that big ?!

~1M ASVs should be computationally tractable. It is a bit surprising to see that many unique ASVs from just 68 samples of ~500k reads per sample. What environment is being sampled here?

Some general ideas to try: Try to make sure that primer removal is being handled correctly. If you are using a standard V3V4 approach like the "Illumina" library approach in which the fixed length primers appear at the start of the reads, you will get better results by removing them using filterAndTrim(...,trimLeft=c(FWD_PRIMER_LEN, REV_PRIMER_LEN)` than using cutadapt, which could be introducing issues into the processing depending on exactly what flags you are using.

You can enforce a minParentAbundance threshold in removeBimeraDenovo. This will reduce the number of ASVs considerd as possible chimeric "parents" and thereby reduce running time. For your dataset, setting minParentAbundance=10 for example might cut down running time quite a bit without much negative effect in terms of chimera detection.

You could look at running time as a function of including 1, 2, 4, 8 ... samples to get a sense of how time is scaling with dataset size, to get a better idea how long 68 samples should take.

ChloePZS commented 2 years ago

Thanks a lot for your quick reply!

The process doesn't do any swapping and we haven't use any cluster either. The computer I am using has 12 cores (2.4GHz) and 16gb of RAM. I did a test with two samples (40 159 ASV), and it took ~2h by using multithreading on 11 cores and 'consensus' method.

After checking the ASVs distribution across samples, I noticed that ~1M ASV were only detected into 1 sample... image

So you were right, I shouldn't have that much unique ASV ! Most of those ASV must be erroneous. Samples are from coral reef water filters (5L), 12 replicates for each level of treatment + blanks

ASV length distribution is indeed wide, with quite a lot of short sequences. image

The sequencing platform used 4 different primer sequences that vary from 1-3nt. Those extra bases were present before the FWD & REV primers. Hence, I couldn't use trimLeft as my primers were of different lengths. So I used Cutadapt after having removed N bases with the primers (341F - 805R) : FWD <- "CCTACGGGNGGCWGCAG"
REV <- "GACTACHVGGGTATCTAATCC" But Cutadapt is definitely producing reads of different length, which I wasn't sure how to deal with.

In the filterAndTrim(...), I used truncLen = c(226,217), to remove the last 10nt of FWD & 5nt of REV reads considering the minimum length the reads would have. But it seems that it wasn't the right way to go. As you suggested, it would have been better to set a min & max length then? Reads are of quite good quality (below the raw reads), maybe there is no need of truncating the ends? image

Actually when running plotQualityProfile on my cutFs & cutRs reads, I had the error "Error in density.default(qscore) : 'x' contains missing values". So there is definitely something wrong with my cutadapt reads...

Many thanks again, Chloé

benjjneb commented 2 years ago

The sequencing platform used 4 different primer sequences that vary from 1-3nt. Those extra bases were present before the FWD & REV primers. Hence, I couldn't use trimLeft as my primers were of different lengths. So I used Cutadapt after having removed N bases with the primers (341F - 805R) : FWD <- "CCTACGGGNGGCWGCAG" REV <- "GACTACHVGGGTATCTAATCC" But Cutadapt is definitely producing reads of different length, which I wasn't sure how to deal with.

So this is almost certainly the source of your problems here. The "heterogeneity spacers" approach was developed to create heterogeneity in the sequenced bases in amplicon libraries as a way to help the Illumina base-calling calibration. But it requires careful and accurate removal of those primers in order to work with DADA2. Basically, DADA2 makes a pretty strong assumption that every read is at least starting in the same position, and if that isn't the case, it creates a variety of problems downstream.

We do not have a pre-built solution for length-varying primer designs. I bet cutadapt can manage this, but cutadapt has a lot of parameters and from experience it can behave different than expected sometimes. Do you have any resources/contacts with the folks who performed this sequencing? Ideally they have already developed a solution for trimming off these variable length primers.

ChloePZS commented 2 years ago

I checked the presence of primers + additional bases before & after Cutadapt Exple for a sample in FWD image

Cutadapt removes the primers and everything before, though I noticed some errors in the primers (e.g. read 3 above). Reads have length variation from 1-3b as expected. There are some primers still present, but it's very minor compared to before cutadapting image image

Despite length variation, do you think DADA2 is still suitable in my case?

I tried using minLen = 200 in filterAndTrim on two of my samples, and the number of ASVs has decreased by 2fold compared to when using trunLen. But I still obtained quite a high number of ASV with 7453 non chimeric sequences with those two samples.

Many thanks again for your insights

benjjneb commented 2 years ago

DADA2 can still work, but the more of those unremoved primers there are, the more issues are going to crop up. My concern is that despite what the primerHits function is saying, there clearly are unremoved primers on e.g. sq3. And as a very rough estimate, 1/10th of your reads may retain unremoved primers.

That said, sq3 points out another issue: That sequence has a large polyG tail that is a common error-type of two-color Illumina chemistries. There may be a significant amount of low complexity sequence "contamination" of your read set.

Try plotComplexity("path/to/reads.fastq") to look at this more closely. Is there a significant low complexity mode (probably dominated by reads with large polyG tails)? Removing that set of reads might help as well.

ChloePZS commented 2 years ago

Thanks a lot for coming back to my issue. plotComplexity (after cutadapting & filtering) gives me something quite good I think : image image

Would you still add "rm.lowcomplex" during filterAndTrim ?

I did some extra tests : added "maxLen = 250" to filterAndTrim so I can get rid of those 251bp reads with unremoved primers. They represented only a small proportion as I still kept ~ 80-90% of the reads.

However, after that, I still obtained most of the ASV detected in a single sample. image

I realized my NovaSeq data could be the problem due to the binned quality scores. I found the issue #1307 .... My error rate estimates were indeed pretty bad...and characteristic of what other users obtained with NovaSeq data. image

Do you think this could explain the inflated number of singletons in my data?

I will try to run the different modified error rate estimation functions and see what gives me the best estimates, and then try again on my test samples.

Cheers, Chloe

benjjneb commented 2 years ago

I don't see any obvious explanation for having such a large number of single-sample ASVs from the diagnostics you've posted so far.

One thing I would look at is the relative abundance that is accounted for by these single-sample ASVs. Are they a bunch of really rare things? Or do they make up a decent fraction of the total reads in some or many samples?

I would probably also do some exploratory BLAST-ing of representative single-sample ASVs (and as a comparative group, cosmpolitan ASVs). Does this suggest anything?

ChloePZS commented 2 years ago

Hi Benjamin,

I ran the different functions for error rates estimation and the best plots were obtained when altering loess arguments (weights and span) & enforcing monotonicity (below for FWD reads). image

I proceeded over my 68 samples and after merging, I obtained ~700K ASVs which is more reasonable. But I still have ~90% of those that are sample-specific image

However, those sample-specific didn't account for an important fraction of total reads/sample image image

Along the pipeline, reads loss is totally acceptable I think and I obtained ~200K non-chimeric ASVs image

I haven't been able to assign taxonomy over all of the 200K ASVs (our server crashes systematically), but with a test with 2 samples (~16K ASVs). ASVs that couldn't be assigned to a Phylum were all sample-specific, and 36% of the Phylum were found only in sample-specific ASVs. I did a few BLASTing and it matched.

At this stage, as you said, I don't see any further explanation for this observed ASVs distribution pattern across samples. Could it be due to the very high sequencing depth of NovaSeq technology, which may just detect rare taxa?

Would it be alright to keep forward by getting rid of the sample-specific ASVs? Which would give me 18 235 ASVs (present in at least 2 samples) to work with across my 68 samples.

Many thanks again! Chloe

benjjneb commented 2 years ago

Would it be alright to keep forward by getting rid of the sample-specific ASVs? Which would give me 18 235 ASVs (present in at least 2 samples) to work with across my 68 samples.

Yes, that kind of filtering is common and acceptable, especially since you've established that these account for a low fraction of total reads, and ruled out obvious computational mistakes as a cause. That said, remember to report that step in your eventual publication.

ChloePZS commented 2 years ago

I'll go ahead and do that then ! Many thanks again for all your insights, it's been very helpful.

Guillermouceda commented 2 weeks ago

Hello @ChloePZS ,

I seem to have the same issue as you. I also had sequences that contained polly-g tails that finally I have got rid of with cutadapt, adding --discard untrimmed flag and setting a minimum length.

Before the cutadapt step the complexity of my sequences was: Complexity_pre_cutadapt.pdf

After cutadapt and quality filtering: Complexity_post_cutadapt_post_filtering.pdf

My error models look like this:

Do you think I should be doing as you did? Should I alter the error function to get a better fit specialy in Rev sequences?

My seqtab has the following dimensions:

> dim(seqtab)
[1]    162 121909

However, when I check the by sample distribution of them. It looks like most of them belong to just one sample

> ASV_sample

     1      2      3      4      5      6      7      8      9     10     11     12     13     14     15     16     17     18     20     21     22     23     24     25     26     27     28     32 
118645   2496    384    146     72     48     22     13     11      8      8      8      3      4      2      3      5      3      2      1      5      1      2      3      1      1      2      1 
    37     38     41     42     47     49     52     97 
     2      1      1      1      1      1      1      1

I have tracked how many sequences were retained after each step and it looks like this:

> track
                      input filtered denoised  merged  tabled nonchim
LM10_R1.fastq.gz        357      278      275     252     252     252
LM100_R1.fastq.gz     11159     8575     8567    2563    2563    2563
LM102_R1.fastq.gz     10369     8722     8718    8696    8696    8612
LM103_R1.fastq.gz         1        1        1       1       1       1
LM103A_R1.fastq.gz 11886403  9226387  9206861 8419004 8419004 7106650
LM105_R1.fastq.gz    108963    87709    87340   75997   75997   73781
LM110_R1.fastq.gz       861      505      491     488     488     488
LM111_R1.fastq.gz     61133    37988    37676   33882   33882   32618
LM114_R1.fastq.gz     18342    11801    11796    8651    8651    8364
LM119_R1.fastq.gz     21097    15793    15651   15083   15083   14987
LM12_R1.fastq.gz     117912    69481    68922   65163   65163   64165
LM122_R1.fastq.gz    255725   187868   186570  173729  173729  168057
LM123_R1.fastq.gz       994      689      680     662     662     662
LM124_R1.fastq.gz      2136     1203     1202    1202    1202    1202
LM13_R1.fastq.gz        340      267      262     262     262     262
LM130_R1.fastq.gz       229      164      164     164     164     164
LM132_R1.fastq.gz    293561    41777    41776   30553   30553   30553
LM133_R1.fastq.gz     94190    72306    72060   71240   71240   71227
LM136_R1.fastq.gz      7423     5237     5213    3869    3869    3784
LM14_R1.fastq.gz       1983     1368     1368    1305    1305    1285
LM17_R1.fastq.gz      44319    34807    34676   33220   33220   32983
LM19_R1.fastq.gz     105516    17724    17375   15454   15454   14813
LM20_R1.fastq.gz     132439     4140     4074    3945    3945    3684
LM22_R1.fastq.gz     109995    69969    69838   69186   69186   69050
LM23_R1.fastq.gz     831582   545849   544730  362330  362330  323708
LM28_R1.fastq.gz     161875   117334   116824  113881  113881  112708
LM29_R1.fastq.gz     122056    93009    92985   90403   90403   89919
LM30_R1.fastq.gz    3044201  1665273  1657296 1454081 1454081 1154577
LM32_R1.fastq.gz       6541     4926     4803    4756    4756    4685
LM33_R1.fastq.gz     478981   388803   388437  373981  373981  368669
LM35_R1.fastq.gz      44474    35023    34997   34570   34570   34406
LM36_R1.fastq.gz      83485    66319    65805   63966   63966   59481
LM37_R1.fastq.gz       2115     1498     1471    1418    1418    1418
LM38_R1.fastq.gz        281      228      224     200     200     200
LM4_R1.fastq.gz       95891    72364    71950   70436   70436   66008
LM41_R1.fastq.gz         20       10       10      10      10      10
LM42_R1.fastq.gz   13948140  8821305  8796837 6979161 6979161 5810209
LM43_R1.fastq.gz      21672    16735    16601   16334   16334   16284
LM44_R1.fastq.gz      43683    31881    31608   29850   29850   29080
LM46_R1.fastq.gz          5        5        5       5       5       5
LM48_R1.fastq.gz      12915     8290     8224    7327    7327    7267
LM5_R1.fastq.gz        5158     3311     3276    3231    3231    2946
LM51_R1.fastq.gz       2392     1946     1915    1909    1909    1868
LM6_R1.fastq.gz       88478    59506    59380   42754   42754   40205
LM60_R1.fastq.gz        268      201      186     181     181     181
LM63_R1.fastq.gz      67672    47892    47459   45424   45424   44947
LM65_R1.fastq.gz     120337    98308    98280   85052   85052   80995
LM68_R1.fastq.gz         73       54       51      51      51      51
LM69_R1.fastq.gz      43707    34123    34085   33421   33421   33234
LM70_R1.fastq.gz     107388    75402    75004   72211   72211   70194
LM71_R1.fastq.gz      26549    17496    17422    8393    8393    8393
LM76_R1.fastq.gz        176      132      131     131     131     131
LM77_R1.fastq.gz    2491429  1670228  1661434 1526054 1526054 1308260
LM78_R1.fastq.gz      59838    33631    33549   19429   19429   19316
LM79_R1.fastq.gz         85       64       56      56      56       0
LM80_R1.fastq.gz      21505    14011    13817   13228   13228   13199
LM82_R1.fastq.gz     123635    90617    89892   85788   85788   84506
LM85_R1.fastq.gz      90983    67158    67104   65658   65658   64152
LM88_R1.fastq.gz      77925    59478    59256   56938   56938   54833
LM9_R1.fastq.gz       71149    56104    56016   55275   55275   55061
LM90_R1.fastq.gz     283033    57017    57013   43948   43948   43839
LM91_R1.fastq.gz      65094    47236    47104   44406   44406   44132
LM93_R1.fastq.gz        492      322      321     321     321     321
LM94_R1.fastq.gz      13349     9455     9413    7096    7096    7048
LM94A_R1.fastq.gz     83064    67845    67679   65063   65063   61663
LM98_R1.fastq.gz      76494    47908    47796   45207   45207   40951
LM99_R1.fastq.gz     110524    76566    76014   69505   69505   67534
LR1_R1.fastq.gz      339834    56085    56061   30794   30794   30794
LR10_R1.fastq.gz      28725    22921    22856   21107   21107   21075
LR101_R1.fastq.gz      1738     1358     1357    1331    1331    1331
LR103_R1.fastq.gz    532168   401712   400763  353847  353847  341575
LR106_R1.fastq.gz        48       34       32      32      32      32
LR108_R1.fastq.gz        86       72       72      53      53      53
LR109_R1.fastq.gz        27       19       19      19      19      19
LR112_R1.fastq.gz        17       14       10      10      10      10
LR113_R1.fastq.gz        65       51       49      49      49      49
LR114_R1.fastq.gz     30126    23068    23042   22503   22503   22500
LR115_R1.fastq.gz    958615   163307   162955  121643  121643   93273
LR116_R1.fastq.gz        25       16       10      10      10      10
LR117_R1.fastq.gz        27       20        6       6       6       6
LR119_R1.fastq.gz      2033     1456     1447    1217    1217    1217
LR121_R1.fastq.gz         7        4        3       0       0       0
LR127_R1.fastq.gz        97       74       71      68      68      68
LR129_R1.fastq.gz    190231    23602    23486   17633   17633   16806
LR13_R1.fastq.gz     303538   242208   241204  235019  235019  232592
LR130_R1.fastq.gz   1554363   219528   219281  149140  149140  111642
LR131_R1.fastq.gz    124397    88501    87956   72937   72937   71613
LR134_R1.fastq.gz     52051     8910     8905    6397    6397    5515
LR135_R1.fastq.gz     10533     6925     6863    6541    6541    6541
LR137_R1.fastq.gz       489      356      349     319     319     319
LR144_R1.fastq.gz    143259     6641     6498    5749    5749    5702
LR146_R1.fastq.gz       247      178      174     174     174     174
LR149_R1.fastq.gz       211      147      134     134     134     134
LR150_R1.fastq.gz     84671    11279    11276    6914    6914    6914
LR152_R1.fastq.gz    276831    37160    37140   32677   32677   32663
LR159_R1.fastq.gz        59       47       40      40      40      40
LR161_R1.fastq.gz    193625    30153    30148   14054   14054   14048
LR164_R1.fastq.gz        51       42       42      42      42      42
LR167_R1.fastq.gz      2743     1742     1695    1554    1554    1554
LR170_R1.fastq.gz      1500     1172     1172    1172    1172    1172
LR172_R1.fastq.gz      1752     1258     1251     444     444     444
LR173_R1.fastq.gz      3098     2397     2392    2164    2164    2164
LR174_R1.fastq.gz    121408    88411    88051   57769   57769   55322
LR175_R1.fastq.gz     56907    14476    14394   12316   12316   12291
LR178_R1.fastq.gz    164542    29879    29879   20121   20121   20121
LR18_R1.fastq.gz      40980    33717    33660   33481   33481   33360
LR180_R1.fastq.gz      5181     3995     3962    3771    3771    3730
LR181_R1.fastq.gz       396       96       94      87      87      87
LR182_R1.fastq.gz    123226    26010    26008   24282   24282   22154
LR184_R1.fastq.gz    231557    38981    38979    9730    9730    9730
LR185_R1.fastq.gz    263451    54144    54142   37174   37174   37150
LR188_R1.fastq.gz       628      411      404     170     170     170
LR189_R1.fastq.gz        56       29       24      17      17      17
LR19_R1.fastq.gz     213066    34021    34021   32659   32659   32588
LR190_R1.fastq.gz       147      109      103     103     103     103
LR191_R1.fastq.gz     22530    15144    15061   14205   14205   14198
LR193_R1.fastq.gz    174812   130680   129665  115371  115371  111797
LR197_R1.fastq.gz    296581    66017    66012   52334   52334   52052
LR20_R1.fastq.gz      77429    58220    58166   57353   57353   56950
LR201_R1.fastq.gz     55370    10049    10048    7731    7731    7591
LR23_R1.fastq.gz        366      290      282     279     279     279
LR25_R1.fastq.gz      22152    11928    11894   11683   11683   11588
LR27_R1.fastq.gz        529      410      408     376     376     376
LR3_R1.fastq.gz      231469    43031    42967   21313   21313   21307
LR30_R1.fastq.gz     315135    86456    86392   67340   67340   67107
LR33_R1.fastq.gz        830      614      604     604     604     568
LR36_R1.fastq.gz          2        2        2       2       2       2
LR37_R1.fastq.gz     251712    40892    40784   27720   27720   27408
LR38_R1.fastq.gz     107698    21720    21580   15762   15762   15582
LR39_R1.fastq.gz      37793    26378    26108   19881   19881   19689
LR40_R1.fastq.gz    2322366   351949   351675  255123  255123  198909
LR41_R1.fastq.gz       4662     3052     3028    2470    2470    2470
LR42_R1.fastq.gz     435529    53002    52670   33892   33892   31869
LR43_R1.fastq.gz         28        3        3       0       0       0
LR44_R1.fastq.gz         17        2        2       2       2       2
LR45_R1.fastq.gz        164      123      121      83      83      83
LR48_R1.fastq.gz     168176   138884   138777  136875  136875  134974
LR55_R1.fastq.gz         53       38       36      36      36      36
LR6_R1.fastq.gz          18       14       14      14      14      14
LR61_R1.fastq.gz     296161    72951    72891   60809   60809   60809
LR65_R1.fastq.gz     143595    24782    24618    7346    7346    7177
LR74_R1.fastq.gz     233096    23980    23638   11140   11140    9956
LR75_R1.fastq.gz       7258     5569     5540    5339    5339    5218
LR77_R1.fastq.gz     102832    73100    73075   70931   70931   70851
LR82_R1.fastq.gz     247289    52037    52033   34494   34494   34494
LR87_R1.fastq.gz     311674    51498    51484   43702   43702   43673
LR89_R1.fastq.gz       2840     1884     1819    1722    1722    1722
LR90_R1.fastq.gz      89830    69988    69945   69238   69238   68914
LR92_R1.fastq.gz     273343    24547    24533   18671   18671   18308
LR94_R1.fastq.gz     221774    14264    14254    6363    6363    6358
LR97_R1.fastq.gz        722      515      498     462     462     462
LR99_R1.fastq.gz     377573    75811    75810   50439   50439   50224
NK2_R1.fastq.gz           1        1        1       0       0       0
VK1_R1.fastq.gz       90959    64602    64331   62733   62733   61656
VK11_R1.fastq.gz      98570    52215    51814   46388   46388   45180
VK13_R1.fastq.gz      15964    11888    11878    8296    8296    8296
VK17_R1.fastq.gz     137504   112343   112117  109205  109205  108529
VK2_R1.fastq.gz       82691    57915    57731   50589   50589   49043
VK20_R1.fastq.gz      11326     8467     8372    8171    8171    8161
VK21_R1.fastq.gz     898693   699138   697027  661916  661916  639725
VK5_R1.fastq.gz         190      151      151     151     151     151
VK7_R1.fastq.gz      118274    88943    88657   76874   76874   75863

On average I retained 50% of the sequences, whch looks similar to the figure you showed in the plot.

I was wondering how did you filter out those ASVs and when it happened? Did you do the filtering before the chimera removal or after ?

Thank you for your help!

ChloePZS commented 2 weeks ago

Hello @Guillermouceda, Is your data from NovaSeq ? Because, if this is the case, I would definitely recommend trying out modified error rate functions. There is a great tutorial out there https://github.com/ErnakovichLab/dada2_ernakovichlab.

I am just very surprised to see such variability in your library size, how can some samples have 1 read ?

For the singletons, I remove them after chimeric sequences have been filtered out, but only once I am certain that the pipeline is functioning correctly (e.g., proper read length, successful merging, etc.). However, I always verify that none of these ASVs represent an important fraction of the data, both in terms of % total reads and % reads within each sample.

Hope this helps !

Cheers, Chloé

Guillermouceda commented 2 weeks ago

Hello @ChloePZS,

Thank you for your suggestions and for answering so quick. I am new to dada2. Would you willing to share the chunk of code with which you filtered out the singletons?

I will definitely check that tutorials.

Cheers,

Guillermo