benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

learnErrors memory full #1606

Closed ColdySnow closed 3 months ago

ColdySnow commented 1 year ago

Hello!

I tried now several times running "learnErrors" of DADA2 in R. Loading the package worked fine and every step described until that. It runs about an hour, gives as output "5485896480 total bases in 22857902 reads from 1 samples will be used for learning the error rates.". Then it takes a little longer until it sais "KILLED" and both, the process and R, are closed. I found out, that this is because our server has not enough memory capacity (it is 62.5GB).

We don't have any more memory/RAM. So what could we do? Is there any solution you can think of?

I appraciate any kind of help!

Best, Christin - Master Student from University of Cologne

benjjneb commented 1 year ago

Your samples are very deep for amplicon sequencing (~23M reads). Is this expected?

That sample depth is at the edge of what we've targeted, and it will require substantial memory and running time (both scale super-linearly with single-sample depth).

My best practical suggestion is to enforce more stringent filtering, as that is fairly effective at reducing the number of unique sequences in the data and therefore the memory/time requirements.

ColdySnow commented 1 year ago

Ah okay, thank you! I'll try it out.

Thing is that our institution normally uses OTUs. They put all samples together in one, everything marked with a specific barcode to recognize later to which sample the probe belongs to. Thats why we have only one but therefore very large amplicon sequencing.

benjjneb commented 1 year ago

They put all samples together in one, everything marked with a specific barcode to recognize later to which sample the probe belongs to. Thats why we have only one but therefore very large amplicon sequencing.

You're best solution here is to use that barcode to separate the sequences into per-sample fastq files. There are a variety of solutions for this sort of "demultiplexing" out there, although their applicability depends on the specific barcoding that is being performed.

ColdySnow commented 1 year ago

Hello again, I did what you told me and demultiplexed our data. This worked so far, but the whole output turned out as fastq.gz files. Just to be sure: the Quality plotting and the learning error rates too are only functioning with fastq datas, right? So I have to unzip all fastq.gz data, correct?

Thank you for all your support, it really helps me!

benjjneb commented 1 year ago

All functions in the dada2 R package will read gzipped fastq files natively. No need to gunzip them first.