Open emankhalaf opened 1 day ago
To run with pool=TRUE
it is required that all samples be loaded into memory at once to make the "pool" sample that is then analyzed. Thus, there is no way to break apart the job across different nodes in a way that would reduce the memory.
The pseudo-pooling approach is our path forward when pool=TRUE
gets too large for available memory. It is not a perfect match, but our testing shows that it approximates pool=TRUE
while only needing to load one sample into memory at a time. More info here: https://benjjneb.github.io/dada2/pseudo.html
Hello,
I am working on processing a large number of 16S sequence files generated using PacBio sequencing technology, specifically 647 and 944 files in separate runs. I am interested in using the pool option during the denoising step in DADA2. Currently, I am running the R script on a cluster with 200G of RAM.
As I am relatively new to using HPC and clusters, I started with smaller memory allocations (64G) and gradually increased the allocation as the script kept failing due to insufficient memory. Upon consulting the cloud's technical support, I was advised to explore whether parallelization is possible for my code to utilize more cores and request additional CPUs, potentially speeding up the process.
Is parallelization supported in the DADA2 R package? If so, could you kindly guide me on how to implement it?
Below are the parameters I am using in my bash script to run the R script:
SBATCH --time=0-48:00
SBATCH --nodes=1
SBATCH --ntasks-per-node=1
SBATCH --cpus-per-task=1
SBATCH --mem-per-cpu=200000M
Your help is greatly appreciated! Thanks