MorrissyLab / mosaicMPI

mosaicMPI: mosaic multi-resolution program integration
MIT License
8 stars 1 forks source link

slurm array problematic in workflow manager #26

Open whtns opened 1 month ago

whtns commented 1 month ago

Hi, I'm struggling to manage factorization in the snakemake workflow management system. I have been running without parallelization because I was getting errors when running with >1 cpu. I've come to realize that the problem is due to the use of job arrays in the factorization step. Unfortunately snakemake doesn't work well with slurm arrays https://github.com/jdblischak/smk-simple-slurm/tree/main/examples/job-array. I'm now running the factorization step via the cli interface directly in bash scripts.

I'd like to keep running the whole thing in snakemake. Do you have tips for adapting without use of arrays?

verheytb commented 1 month ago

Just for the sake of clarity, there are two levels of parallelization:

When you factorize using mosaicMPI (both in the command-line and in the python API), you will automatically use all available CPUs on the machine, as mosaicMPI relies on scikit-learn/numpy modules which control this behaviour. You can change this to a smaller number of threads/CPUs if desired: https://scikit-learn.org/stable/computing/parallelism.html. By default, both of those methods will use all CPUs and will line up jobs for a single machine only.

To achieve high-level parallelism, mosaicMPI allows a user to spread out the computation across multiple workers. A 'worker' refers to a computer or separate HPC node, each of which will use all CPUs on that computer/node. As a convenience measure for users submitting directly to SLURM, a script can be used to automatically submit an array job. However this may not fit everyone's needs, so the fallback is to directly control what work needs to be done, using the total_workers and worker_index parameters.

For example, for a dataset called test_dataset and the output_directory set to output, you could factorize across 4 nodes using these commands which could be submitted as separate commands for separate SLURM jobs (instead of an array job):

mosaicmpi factorize -n test_dataset -o output_directory --total_workers 4 --worker_index 0
mosaicmpi factorize -n test_dataset -o output_directory --total_workers 4 --worker_index 1
mosaicmpi factorize -n test_dataset -o output_directory --total_workers 4 --worker_index 2
mosaicmpi factorize -n test_dataset -o output_directory --total_workers 4 --worker_index 3

You would just need to figure out how to iterate over the worker_index variable in snakemake.