InstituteforDiseaseModeling / idmtools

https://docs.idmod.org/projects/idmtools/en/latest/
Other
4 stars 4 forks source link

slurm platform -- unable to run from compute nodes as expected #2145

Open ckirkman-IDM opened 1 year ago

ckirkman-IDM commented 1 year ago

NYU is unable to use recent idmtools-platform-slurm code while logged into a compute node, e.g. srun --nodes=1 --ntasks-per-node=1 --time=04:00:00 --partition=a100_dev --pty bash -i

They do this to limit the impact running calibration, etc. has on their head/login nodes.

Currently, jobs are run (indicated in idmtools.ini) on a different partition (not a100_dev, above):

partition = cpu_short

However, running their jobs AFTER doing such a command leads to the error noted below in the image. The error occurs at the experiment level (in an experiment directory stderr.txt)

The REALLY funky thing is that I get ONE simulation to run/complete successfully, the others fail, despite having this in my platform instantiation:

calib_manager.platform = Platform(args.platform, max_running_jobs=1000000, array_batch_size=1000000)

I have verified that their slurm version is 23.02.3

From: https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793613/Troubleshooting+Slurm+Jobs image

Is there a way around this based on how the platform code is written? Or will there be a limit to require, e.g., users to srun (as above) to the partition they intend to run on?

devclinton commented 1 year ago

I lean toward saying limiting users for now to srun. I almost think this is more a symptom of calibration needing a rethink to be more stateless and less resource intensive.

We should look if possibly custom sbatch commands could help here as well.

Bridenbecker commented 4 months ago

From David Kaftan at NYU

kaftand commented 4 months ago

If someone assigns this issue to me, I'd be happy to work on it! (just don't want to duplicate work)