Open ckirkman-IDM opened 1 year ago
I lean toward saying limiting users for now to srun. I almost think this is more a symptom of calibration needing a rethink to be more stateless and less resource intensive.
We should look if possibly custom sbatch commands could help here as well.
From David Kaftan at NYU
If someone assigns this issue to me, I'd be happy to work on it! (just don't want to duplicate work)
NYU is unable to use recent idmtools-platform-slurm code while logged into a compute node, e.g. srun --nodes=1 --ntasks-per-node=1 --time=04:00:00 --partition=a100_dev --pty bash -i
They do this to limit the impact running calibration, etc. has on their head/login nodes.
Currently, jobs are run (indicated in idmtools.ini) on a different partition (not a100_dev, above):
However, running their jobs AFTER doing such a command leads to the error noted below in the image. The error occurs at the experiment level (in an experiment directory stderr.txt)
The REALLY funky thing is that I get ONE simulation to run/complete successfully, the others fail, despite having this in my platform instantiation:
I have verified that their slurm version is 23.02.3
From: https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793613/Troubleshooting+Slurm+Jobs
Is there a way around this based on how the platform code is written? Or will there be a limit to require, e.g., users to srun (as above) to the partition they intend to run on?