Closed ChloePZS closed 6 months ago
Although not true in most cases, chimera removal can become the most computationally costly step in some datasets. There are tools to address this though, and also possible to better estimate how long the process will take.
Could the multithreading actually saturate the machine? Would it be better to use only 1 core or more?
Not sure what this means exactly. One thing I would check is that you have enough memory, and that you aren't running into issues with "swapping" between memory and the filesystem, which will dramatically slow computation. If submitting on a cluster, make sure you are requesting sufficient memory, and check how much memory is being used in an example run of that dataset. Memory requirements are not meaningfully affected by multithreading, so you definitely will want multithreading turned on.
What method would you advise: consensus vs per-sample ? Or other function arguments?
pool=TRUE
for chimera identification is only recommended if using the pool=TRUE
mode during dada
denoising. Based on what you've said about following the Big Data workflow, you should use the default "consensus" mode of chimera removal.
Would it be better to remove all short sequences beforehand?
What is the length distribution in your data? There is a possibility that the data processing you did with cutadapt introduced artefactual length variation into the dataset. In general it is totally valid to enforce a length distribution (e..g min/max sequence lengths) as a filtering strategy. Removing ASVs will reduce the size and therefore running time of the sequence table.
Is my dataset that big ?!
~1M ASVs should be computationally tractable. It is a bit surprising to see that many unique ASVs from just 68 samples of ~500k reads per sample. What environment is being sampled here?
Some general ideas to try: Try to make sure that primer removal is being handled correctly. If you are using a standard V3V4 approach like the "Illumina" library approach in which the fixed length primers appear at the start of the reads, you will get better results by removing them using filterAndTrim(...,
trimLeft=c(FWD_PRIMER_LEN, REV_PRIMER_LEN)` than using cutadapt, which could be introducing issues into the processing depending on exactly what flags you are using.
You can enforce a minParentAbundance
threshold in removeBimeraDenovo
. This will reduce the number of ASVs considerd as possible chimeric "parents" and thereby reduce running time. For your dataset, setting minParentAbundance=10
for example might cut down running time quite a bit without much negative effect in terms of chimera detection.
You could look at running time as a function of including 1, 2, 4, 8 ... samples to get a sense of how time is scaling with dataset size, to get a better idea how long 68 samples should take.
Thanks a lot for your quick reply!
The process doesn't do any swapping and we haven't use any cluster either. The computer I am using has 12 cores (2.4GHz) and 16gb of RAM. I did a test with two samples (40 159 ASV), and it took ~2h by using multithreading on 11 cores and 'consensus' method.
After checking the ASVs distribution across samples, I noticed that ~1M ASV were only detected into 1 sample...
So you were right, I shouldn't have that much unique ASV ! Most of those ASV must be erroneous. Samples are from coral reef water filters (5L), 12 replicates for each level of treatment + blanks
ASV length distribution is indeed wide, with quite a lot of short sequences.
The sequencing platform used 4 different primer sequences that vary from 1-3nt. Those extra bases were present before the FWD & REV primers. Hence, I couldn't use trimLeft as my primers were of different lengths. So I used Cutadapt after having removed N bases with the primers (341F - 805R) :
FWD <- "CCTACGGGNGGCWGCAG"
REV <- "GACTACHVGGGTATCTAATCC"
But Cutadapt is definitely producing reads of different length, which I wasn't sure how to deal with.
In the filterAndTrim(...), I used truncLen = c(226,217), to remove the last 10nt of FWD & 5nt of REV reads considering the minimum length the reads would have. But it seems that it wasn't the right way to go. As you suggested, it would have been better to set a min & max length then? Reads are of quite good quality (below the raw reads), maybe there is no need of truncating the ends?
Actually when running plotQualityProfile on my cutFs & cutRs reads, I had the error "Error in density.default(qscore) : 'x' contains missing values". So there is definitely something wrong with my cutadapt reads...
Many thanks again, Chloé
The sequencing platform used 4 different primer sequences that vary from 1-3nt. Those extra bases were present before the FWD & REV primers. Hence, I couldn't use trimLeft as my primers were of different lengths. So I used Cutadapt after having removed N bases with the primers (341F - 805R) : FWD <- "CCTACGGGNGGCWGCAG" REV <- "GACTACHVGGGTATCTAATCC" But Cutadapt is definitely producing reads of different length, which I wasn't sure how to deal with.
So this is almost certainly the source of your problems here. The "heterogeneity spacers" approach was developed to create heterogeneity in the sequenced bases in amplicon libraries as a way to help the Illumina base-calling calibration. But it requires careful and accurate removal of those primers in order to work with DADA2. Basically, DADA2 makes a pretty strong assumption that every read is at least starting in the same position, and if that isn't the case, it creates a variety of problems downstream.
We do not have a pre-built solution for length-varying primer designs. I bet cutadapt can manage this, but cutadapt has a lot of parameters and from experience it can behave different than expected sometimes. Do you have any resources/contacts with the folks who performed this sequencing? Ideally they have already developed a solution for trimming off these variable length primers.
I checked the presence of primers + additional bases before & after Cutadapt Exple for a sample in FWD
Cutadapt removes the primers and everything before, though I noticed some errors in the primers (e.g. read 3 above). Reads have length variation from 1-3b as expected. There are some primers still present, but it's very minor compared to before cutadapting
Despite length variation, do you think DADA2 is still suitable in my case?
I tried using minLen = 200 in filterAndTrim on two of my samples, and the number of ASVs has decreased by 2fold compared to when using trunLen. But I still obtained quite a high number of ASV with 7453 non chimeric sequences with those two samples.
Many thanks again for your insights
DADA2 can still work, but the more of those unremoved primers there are, the more issues are going to crop up. My concern is that despite what the primerHits
function is saying, there clearly are unremoved primers on e.g. sq3. And as a very rough estimate, 1/10th of your reads may retain unremoved primers.
That said, sq3 points out another issue: That sequence has a large polyG tail that is a common error-type of two-color Illumina chemistries. There may be a significant amount of low complexity sequence "contamination" of your read set.
Try plotComplexity("path/to/reads.fastq")
to look at this more closely. Is there a significant low complexity mode (probably dominated by reads with large polyG tails)? Removing that set of reads might help as well.
Thanks a lot for coming back to my issue. plotComplexity (after cutadapting & filtering) gives me something quite good I think :
Would you still add "rm.lowcomplex" during filterAndTrim ?
I did some extra tests : added "maxLen = 250" to filterAndTrim so I can get rid of those 251bp reads with unremoved primers. They represented only a small proportion as I still kept ~ 80-90% of the reads.
However, after that, I still obtained most of the ASV detected in a single sample.
I realized my NovaSeq data could be the problem due to the binned quality scores. I found the issue #1307 .... My error rate estimates were indeed pretty bad...and characteristic of what other users obtained with NovaSeq data.
Do you think this could explain the inflated number of singletons in my data?
I will try to run the different modified error rate estimation functions and see what gives me the best estimates, and then try again on my test samples.
Cheers, Chloe
I don't see any obvious explanation for having such a large number of single-sample ASVs from the diagnostics you've posted so far.
One thing I would look at is the relative abundance that is accounted for by these single-sample ASVs. Are they a bunch of really rare things? Or do they make up a decent fraction of the total reads in some or many samples?
I would probably also do some exploratory BLAST-ing of representative single-sample ASVs (and as a comparative group, cosmpolitan ASVs). Does this suggest anything?
Hi Benjamin,
I ran the different functions for error rates estimation and the best plots were obtained when altering loess arguments (weights and span) & enforcing monotonicity (below for FWD reads).
I proceeded over my 68 samples and after merging, I obtained ~700K ASVs which is more reasonable. But I still have ~90% of those that are sample-specific
However, those sample-specific didn't account for an important fraction of total reads/sample
Along the pipeline, reads loss is totally acceptable I think and I obtained ~200K non-chimeric ASVs
I haven't been able to assign taxonomy over all of the 200K ASVs (our server crashes systematically), but with a test with 2 samples (~16K ASVs). ASVs that couldn't be assigned to a Phylum were all sample-specific, and 36% of the Phylum were found only in sample-specific ASVs. I did a few BLASTing and it matched.
At this stage, as you said, I don't see any further explanation for this observed ASVs distribution pattern across samples. Could it be due to the very high sequencing depth of NovaSeq technology, which may just detect rare taxa?
Would it be alright to keep forward by getting rid of the sample-specific ASVs? Which would give me 18 235 ASVs (present in at least 2 samples) to work with across my 68 samples.
Many thanks again! Chloe
Would it be alright to keep forward by getting rid of the sample-specific ASVs? Which would give me 18 235 ASVs (present in at least 2 samples) to work with across my 68 samples.
Yes, that kind of filtering is common and acceptable, especially since you've established that these account for a low fraction of total reads, and ruled out obvious computational mistakes as a cause. That said, remember to report that step in your eventual publication.
I'll go ahead and do that then ! Many thanks again for all your insights, it's been very helpful.
Hello @ChloePZS ,
I seem to have the same issue as you. I also had sequences that contained polly-g tails that finally I have got rid of with cutadapt, adding --discard untrimmed
flag and setting a minimum length.
Before the cutadapt step the complexity of my sequences was: Complexity_pre_cutadapt.pdf
After cutadapt and quality filtering: Complexity_post_cutadapt_post_filtering.pdf
My error models look like this:
Forward sequences:
Rev sequences:
Do you think I should be doing as you did? Should I alter the error function to get a better fit specialy in Rev sequences?
My seqtab has the following dimensions:
> dim(seqtab)
[1] 162 121909
However, when I check the by sample distribution of them. It looks like most of them belong to just one sample
> ASV_sample
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 26 27 28 32
118645 2496 384 146 72 48 22 13 11 8 8 8 3 4 2 3 5 3 2 1 5 1 2 3 1 1 2 1
37 38 41 42 47 49 52 97
2 1 1 1 1 1 1 1
I have tracked how many sequences were retained after each step and it looks like this:
> track
input filtered denoised merged tabled nonchim
LM10_R1.fastq.gz 357 278 275 252 252 252
LM100_R1.fastq.gz 11159 8575 8567 2563 2563 2563
LM102_R1.fastq.gz 10369 8722 8718 8696 8696 8612
LM103_R1.fastq.gz 1 1 1 1 1 1
LM103A_R1.fastq.gz 11886403 9226387 9206861 8419004 8419004 7106650
LM105_R1.fastq.gz 108963 87709 87340 75997 75997 73781
LM110_R1.fastq.gz 861 505 491 488 488 488
LM111_R1.fastq.gz 61133 37988 37676 33882 33882 32618
LM114_R1.fastq.gz 18342 11801 11796 8651 8651 8364
LM119_R1.fastq.gz 21097 15793 15651 15083 15083 14987
LM12_R1.fastq.gz 117912 69481 68922 65163 65163 64165
LM122_R1.fastq.gz 255725 187868 186570 173729 173729 168057
LM123_R1.fastq.gz 994 689 680 662 662 662
LM124_R1.fastq.gz 2136 1203 1202 1202 1202 1202
LM13_R1.fastq.gz 340 267 262 262 262 262
LM130_R1.fastq.gz 229 164 164 164 164 164
LM132_R1.fastq.gz 293561 41777 41776 30553 30553 30553
LM133_R1.fastq.gz 94190 72306 72060 71240 71240 71227
LM136_R1.fastq.gz 7423 5237 5213 3869 3869 3784
LM14_R1.fastq.gz 1983 1368 1368 1305 1305 1285
LM17_R1.fastq.gz 44319 34807 34676 33220 33220 32983
LM19_R1.fastq.gz 105516 17724 17375 15454 15454 14813
LM20_R1.fastq.gz 132439 4140 4074 3945 3945 3684
LM22_R1.fastq.gz 109995 69969 69838 69186 69186 69050
LM23_R1.fastq.gz 831582 545849 544730 362330 362330 323708
LM28_R1.fastq.gz 161875 117334 116824 113881 113881 112708
LM29_R1.fastq.gz 122056 93009 92985 90403 90403 89919
LM30_R1.fastq.gz 3044201 1665273 1657296 1454081 1454081 1154577
LM32_R1.fastq.gz 6541 4926 4803 4756 4756 4685
LM33_R1.fastq.gz 478981 388803 388437 373981 373981 368669
LM35_R1.fastq.gz 44474 35023 34997 34570 34570 34406
LM36_R1.fastq.gz 83485 66319 65805 63966 63966 59481
LM37_R1.fastq.gz 2115 1498 1471 1418 1418 1418
LM38_R1.fastq.gz 281 228 224 200 200 200
LM4_R1.fastq.gz 95891 72364 71950 70436 70436 66008
LM41_R1.fastq.gz 20 10 10 10 10 10
LM42_R1.fastq.gz 13948140 8821305 8796837 6979161 6979161 5810209
LM43_R1.fastq.gz 21672 16735 16601 16334 16334 16284
LM44_R1.fastq.gz 43683 31881 31608 29850 29850 29080
LM46_R1.fastq.gz 5 5 5 5 5 5
LM48_R1.fastq.gz 12915 8290 8224 7327 7327 7267
LM5_R1.fastq.gz 5158 3311 3276 3231 3231 2946
LM51_R1.fastq.gz 2392 1946 1915 1909 1909 1868
LM6_R1.fastq.gz 88478 59506 59380 42754 42754 40205
LM60_R1.fastq.gz 268 201 186 181 181 181
LM63_R1.fastq.gz 67672 47892 47459 45424 45424 44947
LM65_R1.fastq.gz 120337 98308 98280 85052 85052 80995
LM68_R1.fastq.gz 73 54 51 51 51 51
LM69_R1.fastq.gz 43707 34123 34085 33421 33421 33234
LM70_R1.fastq.gz 107388 75402 75004 72211 72211 70194
LM71_R1.fastq.gz 26549 17496 17422 8393 8393 8393
LM76_R1.fastq.gz 176 132 131 131 131 131
LM77_R1.fastq.gz 2491429 1670228 1661434 1526054 1526054 1308260
LM78_R1.fastq.gz 59838 33631 33549 19429 19429 19316
LM79_R1.fastq.gz 85 64 56 56 56 0
LM80_R1.fastq.gz 21505 14011 13817 13228 13228 13199
LM82_R1.fastq.gz 123635 90617 89892 85788 85788 84506
LM85_R1.fastq.gz 90983 67158 67104 65658 65658 64152
LM88_R1.fastq.gz 77925 59478 59256 56938 56938 54833
LM9_R1.fastq.gz 71149 56104 56016 55275 55275 55061
LM90_R1.fastq.gz 283033 57017 57013 43948 43948 43839
LM91_R1.fastq.gz 65094 47236 47104 44406 44406 44132
LM93_R1.fastq.gz 492 322 321 321 321 321
LM94_R1.fastq.gz 13349 9455 9413 7096 7096 7048
LM94A_R1.fastq.gz 83064 67845 67679 65063 65063 61663
LM98_R1.fastq.gz 76494 47908 47796 45207 45207 40951
LM99_R1.fastq.gz 110524 76566 76014 69505 69505 67534
LR1_R1.fastq.gz 339834 56085 56061 30794 30794 30794
LR10_R1.fastq.gz 28725 22921 22856 21107 21107 21075
LR101_R1.fastq.gz 1738 1358 1357 1331 1331 1331
LR103_R1.fastq.gz 532168 401712 400763 353847 353847 341575
LR106_R1.fastq.gz 48 34 32 32 32 32
LR108_R1.fastq.gz 86 72 72 53 53 53
LR109_R1.fastq.gz 27 19 19 19 19 19
LR112_R1.fastq.gz 17 14 10 10 10 10
LR113_R1.fastq.gz 65 51 49 49 49 49
LR114_R1.fastq.gz 30126 23068 23042 22503 22503 22500
LR115_R1.fastq.gz 958615 163307 162955 121643 121643 93273
LR116_R1.fastq.gz 25 16 10 10 10 10
LR117_R1.fastq.gz 27 20 6 6 6 6
LR119_R1.fastq.gz 2033 1456 1447 1217 1217 1217
LR121_R1.fastq.gz 7 4 3 0 0 0
LR127_R1.fastq.gz 97 74 71 68 68 68
LR129_R1.fastq.gz 190231 23602 23486 17633 17633 16806
LR13_R1.fastq.gz 303538 242208 241204 235019 235019 232592
LR130_R1.fastq.gz 1554363 219528 219281 149140 149140 111642
LR131_R1.fastq.gz 124397 88501 87956 72937 72937 71613
LR134_R1.fastq.gz 52051 8910 8905 6397 6397 5515
LR135_R1.fastq.gz 10533 6925 6863 6541 6541 6541
LR137_R1.fastq.gz 489 356 349 319 319 319
LR144_R1.fastq.gz 143259 6641 6498 5749 5749 5702
LR146_R1.fastq.gz 247 178 174 174 174 174
LR149_R1.fastq.gz 211 147 134 134 134 134
LR150_R1.fastq.gz 84671 11279 11276 6914 6914 6914
LR152_R1.fastq.gz 276831 37160 37140 32677 32677 32663
LR159_R1.fastq.gz 59 47 40 40 40 40
LR161_R1.fastq.gz 193625 30153 30148 14054 14054 14048
LR164_R1.fastq.gz 51 42 42 42 42 42
LR167_R1.fastq.gz 2743 1742 1695 1554 1554 1554
LR170_R1.fastq.gz 1500 1172 1172 1172 1172 1172
LR172_R1.fastq.gz 1752 1258 1251 444 444 444
LR173_R1.fastq.gz 3098 2397 2392 2164 2164 2164
LR174_R1.fastq.gz 121408 88411 88051 57769 57769 55322
LR175_R1.fastq.gz 56907 14476 14394 12316 12316 12291
LR178_R1.fastq.gz 164542 29879 29879 20121 20121 20121
LR18_R1.fastq.gz 40980 33717 33660 33481 33481 33360
LR180_R1.fastq.gz 5181 3995 3962 3771 3771 3730
LR181_R1.fastq.gz 396 96 94 87 87 87
LR182_R1.fastq.gz 123226 26010 26008 24282 24282 22154
LR184_R1.fastq.gz 231557 38981 38979 9730 9730 9730
LR185_R1.fastq.gz 263451 54144 54142 37174 37174 37150
LR188_R1.fastq.gz 628 411 404 170 170 170
LR189_R1.fastq.gz 56 29 24 17 17 17
LR19_R1.fastq.gz 213066 34021 34021 32659 32659 32588
LR190_R1.fastq.gz 147 109 103 103 103 103
LR191_R1.fastq.gz 22530 15144 15061 14205 14205 14198
LR193_R1.fastq.gz 174812 130680 129665 115371 115371 111797
LR197_R1.fastq.gz 296581 66017 66012 52334 52334 52052
LR20_R1.fastq.gz 77429 58220 58166 57353 57353 56950
LR201_R1.fastq.gz 55370 10049 10048 7731 7731 7591
LR23_R1.fastq.gz 366 290 282 279 279 279
LR25_R1.fastq.gz 22152 11928 11894 11683 11683 11588
LR27_R1.fastq.gz 529 410 408 376 376 376
LR3_R1.fastq.gz 231469 43031 42967 21313 21313 21307
LR30_R1.fastq.gz 315135 86456 86392 67340 67340 67107
LR33_R1.fastq.gz 830 614 604 604 604 568
LR36_R1.fastq.gz 2 2 2 2 2 2
LR37_R1.fastq.gz 251712 40892 40784 27720 27720 27408
LR38_R1.fastq.gz 107698 21720 21580 15762 15762 15582
LR39_R1.fastq.gz 37793 26378 26108 19881 19881 19689
LR40_R1.fastq.gz 2322366 351949 351675 255123 255123 198909
LR41_R1.fastq.gz 4662 3052 3028 2470 2470 2470
LR42_R1.fastq.gz 435529 53002 52670 33892 33892 31869
LR43_R1.fastq.gz 28 3 3 0 0 0
LR44_R1.fastq.gz 17 2 2 2 2 2
LR45_R1.fastq.gz 164 123 121 83 83 83
LR48_R1.fastq.gz 168176 138884 138777 136875 136875 134974
LR55_R1.fastq.gz 53 38 36 36 36 36
LR6_R1.fastq.gz 18 14 14 14 14 14
LR61_R1.fastq.gz 296161 72951 72891 60809 60809 60809
LR65_R1.fastq.gz 143595 24782 24618 7346 7346 7177
LR74_R1.fastq.gz 233096 23980 23638 11140 11140 9956
LR75_R1.fastq.gz 7258 5569 5540 5339 5339 5218
LR77_R1.fastq.gz 102832 73100 73075 70931 70931 70851
LR82_R1.fastq.gz 247289 52037 52033 34494 34494 34494
LR87_R1.fastq.gz 311674 51498 51484 43702 43702 43673
LR89_R1.fastq.gz 2840 1884 1819 1722 1722 1722
LR90_R1.fastq.gz 89830 69988 69945 69238 69238 68914
LR92_R1.fastq.gz 273343 24547 24533 18671 18671 18308
LR94_R1.fastq.gz 221774 14264 14254 6363 6363 6358
LR97_R1.fastq.gz 722 515 498 462 462 462
LR99_R1.fastq.gz 377573 75811 75810 50439 50439 50224
NK2_R1.fastq.gz 1 1 1 0 0 0
VK1_R1.fastq.gz 90959 64602 64331 62733 62733 61656
VK11_R1.fastq.gz 98570 52215 51814 46388 46388 45180
VK13_R1.fastq.gz 15964 11888 11878 8296 8296 8296
VK17_R1.fastq.gz 137504 112343 112117 109205 109205 108529
VK2_R1.fastq.gz 82691 57915 57731 50589 50589 49043
VK20_R1.fastq.gz 11326 8467 8372 8171 8171 8161
VK21_R1.fastq.gz 898693 699138 697027 661916 661916 639725
VK5_R1.fastq.gz 190 151 151 151 151 151
VK7_R1.fastq.gz 118274 88943 88657 76874 76874 75863
On average I retained 50% of the sequences, whch looks similar to the figure you showed in the plot.
I was wondering how did you filter out those ASVs and when it happened? Did you do the filtering before the chimera removal or after ?
Thank you for your help!
Hello @Guillermouceda, Is your data from NovaSeq ? Because, if this is the case, I would definitely recommend trying out modified error rate functions. There is a great tutorial out there https://github.com/ErnakovichLab/dada2_ernakovichlab.
I am just very surprised to see such variability in your library size, how can some samples have 1 read ?
For the singletons, I remove them after chimeric sequences have been filtered out, but only once I am certain that the pipeline is functioning correctly (e.g., proper read length, successful merging, etc.). However, I always verify that none of these ASVs represent an important fraction of the data, both in terms of % total reads and % reads within each sample.
Hope this helps !
Cheers, Chloé
Hello @ChloePZS,
Thank you for your suggestions and for answering so quick. I am new to dada2. Would you willing to share the chunk of code with which you filtered out the singletons?
I will definitely check that tutorials.
Cheers,
Guillermo
Hello, I have a V3-V4 16s DNAr metabarcoding dataset of 68 samples with a read depth of 400-900K reads/sample. Samples were sequenced on a NovaSeq 6000 PE250. Primers were removed with Cutadapt, sequences were all filtered & I followed the workflow for big data. After the merging step, I end up with 1 298 725 ASV.
I tried to run removeBimeraDenovo with default values and multithreading (12 cores processor). But as the estimation was over 18 days (still increasing), I stopped the process. So how could we optimize the processing time of the function? :
Many thanks for your help!
Best, Chloe