benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 143 forks source link

Running DADA2 on clusters with pool option -- run of memory #2061

Open emankhalaf opened 1 day ago

emankhalaf commented 1 day ago

Hello,

I am working on processing a large number of 16S sequence files generated using PacBio sequencing technology, specifically 647 and 944 files in separate runs. I am interested in using the pool option during the denoising step in DADA2. Currently, I am running the R script on a cluster with 200G of RAM.

As I am relatively new to using HPC and clusters, I started with smaller memory allocations (64G) and gradually increased the allocation as the script kept failing due to insufficient memory. Upon consulting the cloud's technical support, I was advised to explore whether parallelization is possible for my code to utilize more cores and request additional CPUs, potentially speeding up the process.

Is parallelization supported in the DADA2 R package? If so, could you kindly guide me on how to implement it?

Below are the parameters I am using in my bash script to run the R script:

SBATCH --time=0-48:00

SBATCH --nodes=1

SBATCH --ntasks-per-node=1

SBATCH --cpus-per-task=1

SBATCH --mem-per-cpu=200000M

Your help is greatly appreciated! Thanks

benjjneb commented 1 day ago

To run with pool=TRUE it is required that all samples be loaded into memory at once to make the "pool" sample that is then analyzed. Thus, there is no way to break apart the job across different nodes in a way that would reduce the memory.

The pseudo-pooling approach is our path forward when pool=TRUE gets too large for available memory. It is not a perfect match, but our testing shows that it approximates pool=TRUE while only needing to load one sample into memory at a time. More info here: https://benjjneb.github.io/dada2/pseudo.html