benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

memory leak during learnErrors #1778

Open Adwyness opened 1 year ago

Adwyness commented 1 year ago

Hi Ben, Using R/R studio within windows and looping a dada2 pipeline over several datasets is jamming up the memory on the non-paged pool. Individual datasets run with <15gb, but after 4 or 5 datasets, this obviously becomes an issue even on a big machine. A bit of testing shows nonpaged memory increases during learnErrors, but cannot be released with gc() or rm() of any parts. A bit of reading shows memory not releasing properly to the OS is a well known pain in the Rs. Have you any thoughts/experience on this? Will update with any progress. Cheers! Ad

Adwyness commented 1 year ago

Update

It occurs during multithreading..

R version 4.3.0 Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045) dada2 version: 1.28.0 RcppParrallel version: 5.1.7

Test data of 52 x 300bp PE miseq samples with a total of 1.95 gb as fastq.gz 11098880 total bases in 48256 reads from 1 sample used for learning the error rates seed set before each iteration

learnErrors( inputfiles, nbases = 1e7, nreads = NULL, errorEstimationFunction = loessErrfun, multithread = x, randomize = FALSE, MAX_CONSIST = 10, OMEGA_C = 0, qualityType = "Auto", verbose = 1)

Cores / memory leaked (mb) / time (s) FALSE / 4 / 324 2 / 42 / 374 4 / 102 / 401 8 / 139 / 233 TRUE(12) / 398 / 243

This problem increases substantially when inferring sequences as it is more memory intensive:

dada(inputfiles, err = errors, multithread = x verbose = 1)

Cores / memory leaked (mb) / time (s) TRUE(12) / 3167 / 1480

When the non-paged pool (NPP) gets to a certain level (in my case ~4.5 gb on a 16 gb RAM laptop), R isnt crashing, but it is taking a lot longer to process each sample. Re-running the above command increased the memory allocated to the NPP at the same rate as the first time running the command (~2 mb/s, until it reached 4.5 gb where it slowed to a relative trickle (~400 kb/s), and per sample processing went from ~30 s to ~180 s, and seems to be increasing (running it currently). As the infer sequences for the forward reads took 3.1 gb, that means a restart of an R session to get the reverse sequences done without processing slowing significantly.

Apologies if this has now turned into a rehash of other windows/Rcpp multithreading issues..

Cheers, Ad