Hello,

I am working on processing a large number of 16S sequence files generated using PacBio sequencing technology, specifically 647 and 944 files in separate runs. I am interested in using the pool option during the denoising step in DADA2. Currently, I am running the R script on a cluster with 200G of RAM.

As I am relatively new to using HPC and clusters, I started with smaller memory allocations (64G) and gradually increased the allocation as the script kept failing due to insufficient memory. Upon consulting the cloud's technical support, I was advised to explore whether parallelization is possible for my code to utilize more cores and request additional CPUs, potentially speeding up the process.

Is parallelization supported in the DADA2 R package? If so, could you kindly guide me on how to implement it?

Below are the parameters I am using in my bash script to run the R script:

SBATCH --time=0-48:00

SBATCH --nodes=1

SBATCH --ntasks-per-node=1

SBATCH --cpus-per-task=1

SBATCH --mem-per-cpu=200000M

Your help is greatly appreciated! Thanks

benjjneb / dada2

Running DADA2 on clusters with pool option -- run of memory #2061

SBATCH --time=0-48:00

SBATCH --nodes=1

SBATCH --ntasks-per-node=1

SBATCH --cpus-per-task=1

SBATCH --mem-per-cpu=200000M