slurm platform -- unable to run from compute nodes as expected

ckirkman-IDM commented 1 year ago

NYU is unable to use recent idmtools-platform-slurm code while logged into a compute node, e.g. srun --nodes=1 --ntasks-per-node=1 --time=04:00:00 --partition=a100_dev --pty bash -i

They do this to limit the impact running calibration, etc. has on their head/login nodes.

Currently, jobs are run (indicated in idmtools.ini) on a different partition (not a100_dev, above):

partition = cpu_short

However, running their jobs AFTER doing such a command leads to the error noted below in the image. The error occurs at the experiment level (in an experiment directory stderr.txt)

The REALLY funky thing is that I get ONE simulation to run/complete successfully, the others fail, despite having this in my platform instantiation:

calib_manager.platform = Platform(args.platform, max_running_jobs=1000000, array_batch_size=1000000)

I have verified that their slurm version is 23.02.3

From: https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793613/Troubleshooting+Slurm+Jobs

Is there a way around this based on how the platform code is written? Or will there be a limit to require, e.g., users to srun (as above) to the partition they intend to run on?

devclinton commented 1 year ago

I lean toward saying limiting users for now to srun. I almost think this is more a symptom of calibration needing a rethink to be more stateless and less resource intensive.

We should look if possibly custom sbatch commands could help here as well.

Bridenbecker commented 4 months ago

From David Kaftan at NYU

Current idmtools are incompatible with running from anything other than the login node on our slurm cluster
- This isn’t a problem for tutorials, but will be with larger workloads.
  - Connections are frequently dropped to the login node (particularly when doing heavy processing) and processes are killed.
  - IT doesn’t like us doing heavy computation on the login node
- I’ve put together a hacky fix, I’ll clean it up and submit a pull request.
  - Also allows us to run from jupyter notebooks

kaftand commented 4 months ago

If someone assigns this issue to me, I'd be happy to work on it! (just don't want to duplicate work)

InstituteforDiseaseModeling / idmtools

slurm platform -- unable to run from compute nodes as expected #2145