FNNDSC / pman

A process management system written in python
MIT License
22 stars 33 forks source link

Multi-node parallelism with number_of_workers #193

Open jennydaman opened 2 years ago

jennydaman commented 2 years ago

number_of_workers can be a way to support embarrassingly parallel jobs on multi-node compute environments.

How can a process identify which replicate it is? It is necessary to know so the workfload can be divided, e.g. in plugin code:

if WORKER_NUMBER == 1:
    process('1.png')
elif WORKER_NUMBER == 2:
    process ('2.png')
....

The equivalent concept in SLURM is a job array.

https://slurm.schedmd.com/job_array.html

e.g.

sbatch --job-array=1-4 job.sh

Four instances of job.sh will be executed, possibly on different compute nodes, and each instance will have an environment variable set SLURM_ARRAY_JOB_ID as 1, 2, 3, or 4.

pman should do something similar.