benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

Bad alloc/bad alignment #1662

Closed otaviolovison closed 1 year ago

otaviolovison commented 1 year ago

Dear dada2 community,

I am having memory issues while running the "Bioconductor workflow for microbiome data analysis: from raw reads to community analyses" and need some orientation. I am using this workflow for more then 2 years and this is the first time I am having issues with this amazing workflow. Let me put it in context.

I am working with upper respiratory tract samples. Libraries were performed following Illumina's protocol (16S v3v4), but sequenced with MiSeq V2 kit (we know that MiSeq V3 kit is recommended and will be using it from now on, but we have also performed successful analyses with V2 kit before). Since we had problems with MiSeq, we have a batch effect on these samples, but I believe this is not the problem. I believe we had a good sequencing quality also (attached - samples with low quality were removed - M49, M66...). RevQualityPlot.pdf FwdQualityPlot.pdf

The filter and trim was perfomed as follows:

for(i in seq_along(fnFs)) { fastqPairedFilter(c(fnFs[[i]], fnRs[[i]]), c(fnFs[[i]], fnRs[[i]]), trimLeft=c(27,31), truncLen=c(240, 240), maxN=0, maxEE=2, truncQ=2, compress=TRUE) }

I performed a pooled inference of sequences which resulted in around 11 million sequences and more then 1 million unique sequences.

dadaFs <- dada(derepFs, err=ddF[[1]]$err_out, pool=TRUE) dadaRs <- dada(derepRs, err=ddR[[1]]$err_out, pool=TRUE)

After merging, chimera removal and trimming of non-target length sequences, these resulted in which I think is a plausible number of ASVs.

mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs) seqtab.all <- makeSequenceTable(mergers[!grepl("Mock", names(mergers))]) seqtab2 <- removeBimeraDenovo(seqtab.all) table(nchar(getSequences(seqtab2)))

If I am not wrong, my expected amplicon size is ~402. The majority of my sequences length were between 398-409 (402 and 407 had most of them), so I removed the sequences outside this range.

seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% seq(398,408)]

Reads through the pipeline:

Reads through the pipeline input filtered denoisedF denoisedR merged nonchim M03 213895 177214 176972 176990 174471 146119 M04 214250 177991 176814 177140 167873 63085 M05 111659 76484 76335 76409 75549 72923 M06 205858 144111 142696 143278 133717 66378 M07 221406 183339 182223 182781 175147 83000 M08 202433 171376 170365 170763 162915 100385 M10 260685 212201 211421 211882 206629 157737 M11 168531 120209 119097 119297 111884 51200 M13 225351 186372 185161 185649 177682 95916 M14 238618 193832 192852 193165 186893 121542 M15 255374 176133 175008 175574 167765 69808 M16 210136 139221 138490 138794 132123 50854 M17 173373 123043 122929 122965 121961 110472 M18 234219 161497 159773 159856 149661 90075 M20 205971 137406 136095 136592 129638 66840 M21 169008 105744 105091 104999 100811 72806 M22 173211 128590 127405 127632 119163 50332 M23 222669 176956 175712 175855 168136 121634 M24 253965 189089 187971 188081 179780 109824 M25 271688 197873 196825 196877 190451 110410 M26 214020 176747 175353 175750 165288 66787 M27 169323 121260 120503 120702 115835 77525 M28 98143 73598 73447 73461 72256 45415 M29 168769 114365 113325 113765 106913 57167 M30 79706 47161 46705 46747 43990 32545 M31 216586 174373 173449 173069 166518 115607 M32 245302 198261 195913 196865 180968 71785 M33 252679 174899 174184 174269 169890 114857 M34 165374 115868 115657 115624 113552 90644 M35 295397 241362 239965 240552 231041 117235 M36 179980 123456 122385 122763 116195 64997 M37 137855 88581 87982 88225 84790 55937 M38 221521 173301 171426 171445 157802 71829 M39 237677 165944 164886 164375 154693 68683 M40 138247 107340 106856 106980 103382 59637 M41 74104 57649 56946 57051 53156 22797 M42 250248 216338 215940 215957 212792 126272 M43 201777 165061 162960 163791 150705 76924 M44 194097 159804 158934 159243 153752 76524 M45 154146 115990 114709 115034 107032 67916 M46 98062 74028 73436 73698 69333 34707 M47 151414 114086 113620 113802 110812 68390 M50 160701 124570 124090 124157 121332 93739 M51 172661 125215 124349 124694 118074 78879 M52 171285 110213 109614 109786 104073 68427 M53 78156 57138 56807 56909 54413 30587 M54 207224 157448 157096 157168 155455 124792 M55 117487 89900 89137 89387 84053 39041 M56 150359 115705 115185 115323 112669 68167 M57 185730 147264 146206 146635 140884 83492 M58 168137 130837 130102 130269 124604 61054 M59 128055 103151 102778 102957 101100 60711 M60 115132 85510 84374 84719 76580 41714 M61 268359 219507 218663 218897 211526 120186 M62 152527 112127 111656 111649 107430 66822 M63 189507 143665 140768 141921 126686 74779 M64 164891 126167 125433 125563 120433 71083 M65 214833 135956 135166 135227 129453 93481 M66 67953 19342 19144 19203 17511 12326 M67 24965 19296 19093 19165 18226 14620 M68 368955 283223 281144 281786 268906 175704 M69 221485 167142 165543 165323 149716 70501 M70 237922 166857 165619 165967 155037 67710 M72 247257 198857 197425 198209 192654 140753 M73 241340 195173 194223 194673 188389 123507 M74 259993 199604 198181 198941 191798 124866 M75 447637 365839 361253 362044 336607 139179 M76 92208 59597 59080 59212 55080 30423 M77 261440 205708 203914 204524 191137 98587 M78 269973 222774 222089 222217 217919 164074 M80 232728 185294 182677 183727 169525 92942 M81 221378 177994 175379 176259 161605 88817 M82 87115 37924 37767 37810 36643 23878 M83 258835 193636 193044 193131 189052 139140 M84 100648 71908 71607 71642 69494 30952 M85 402056 309207 307966 308214 298330 186767 M86 72130 25457 25401 25424 25135 20839 M87 228186 180601 179640 179982 173012 90041 M88 135730 58092 57940 57960 56081 31948

Then I performed Assign Taxonomy with eHOMD database formatted for dada2 available here. https://github.com/fconstancias/metabaRpipe/tree/master/databases

Then the problem started. I tried to perform sequence alignment as follows: seqs <- getSequences(seqtab) names(seqs) <- seqs # This propagates to the tip labels of the tree mult <- msa(seqs, method="ClustalW", type="dna", order="input")

This resulted in bad alloc error.

Then I changed to DECIPHER alignment, as follows:

seqs <- getSequences(seqtab) names(seqs) <- seqs # This propagates to the tip labels of the tree alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA)

This ran perfectly and completed in less then 24 hours. I don't know how to check for alignment quality, but I believe that the problem is here (suggestions on how to check alignment quality?).

After this, I performed the following:

phang.align <- phyDat(as(alignment, "matrix"), type="DNA") dm <- dist.ml(phang.align)

This also completed and the resulting 'dm' object was extremelly large (> 18gb). Until now, the objects had reasonable sizes. Comparing with previous analysis, executed with MSA, this is the first time I had this size of 'dm' object. But it is also possible that this is a feature of this specific data.. I don't know..

Then, when I tried to execute:

treeNJ <- NJ(dm)

I got the bad alloc error again..

Then I decided to revisit the alignment... trimmed the data a little more (398-408), which made MSA alignment executable wth our memory capacity. But the MSA alignment is running for 5 days by now, and I don't know how much time it will take to finish... I have a close deadline and can't wait for this anymore.

My questions are:

Do you detect any preprocessing failures that could be compromising this analysis? The 'dm' object is really to large or this is a common size for this object? If this is a common size for this object, any suggestions on how to build a phylogenetic tree with this (64gb RAM memory)? If this is not a common size for 'dm' object, any suggestion on how to solve the DECIPHER alignment problem (if this is really an alignment problem)?

Feel free to ask for any additional information you need. Hope you can help me with this because I'm getting a little desperate here...

Best wishes.

benjjneb commented 1 year ago

Hi Otavio, Can you repost this at the F1000workflow Github issues page? https://github.com/spholmes/F1000_workflow/issues

Also, you may find some previous issues that are releveant there. I suspect the issue is that alignment is too large for these R-based tools to handle at the phylogenetic step. Going outside R at that step to something like RaxML may be necessary, but there should be some more discussion of that in the previous issues at that repo.

otaviolovison commented 1 year ago

Of course! Thank you so much!

otaviolovison commented 1 year ago

Dear @benjjneb, I have some questions regarding my preprocessing steps and I'd be really grateful if you could help me with these:

benjjneb commented 1 year ago

(I believe) an enormous amount of sequences were generated from my data. I ended up with more than 37,000 ASVs... what could have caused that? Batch effect?

Could be many reasons. Lots of diversity in the environment. Deep sequencing and/or lots of samples. Varied sample types. In larger studies of diverse environments, 37k ASVs is not that unexpected.

I have a great number of sequences outside the expected amplicon size. I removed the sequences outside 398-408 bp. Do you believe I should be more radical in this trimming? My expected amplicon size is 402 bp (since we used MiSeq v2 kit for 16s v3v4...). I was thinking about 399-405bp... What's your opinion on that?

No, there is biological length variation in the V3V4 locus. Most of the gain from length trimming is from removing the way-out-of-range sequences. Trying to make it really tight usually isn't advisable, as the loss of real ASVs starts to occur.

Also, be careful here, V3V4 has a bimodal length distribution in nature, with the two modes being almost 20 nts different in length. That is, you may (probably?) have a mode of real V3V4 sequences at about 20 nts longer than the window you are enforcing.

otaviolovison commented 1 year ago

Thank you so much!

I gave a look in my data and I do have another group of sequences, around 382.. I will reprocess the data. Another question, please: I have lots of sequences outside the expected range. What might have caused that? I am thinking about the clean up process... maybe the beads were not working?

benjjneb commented 1 year ago

Another question, please: I have lots of sequences outside the expected range. What might have caused that? I am thinking about the clean up process... maybe the beads were not working?

Unfortunately you will have to find someone with more knowledge on the wet lab side to guide you there. That is not my area of expertise.

otaviolovison commented 1 year ago

No problem! Thank you so much!