Scheduler running out of memory when launching a larger (100+) number of nodes simultaneously

Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.

MIT License

59 stars 43 forks source link

Scheduler running out of memory when launching a larger (100+) number of nodes simultaneously #260

Open jhrmnn opened 5 months ago

jhrmnn commented 5 months ago

CycleCloud version: 8.6.2-3276 Slurm version: 22.05.11

AFAIK, CycleCloud's prolog script calls get_acct_info.sh which calls azslurm accounting_info and this happens for each launched node. I'm observing that each launch of azslurm accounting_info takes ~150MB of memory, so when launching hundreds of nodes simultaneously, the scheduler can easily get out of memory.

Currently I'm mitigating by commenting out the call to get_acct_info.sh in the prolog script.

aditigaur4 commented 5 months ago

We are definitely working on improving that prolog script in the next release. But i just wanted to clarify that it runs on every job launch and not node launch. Also if you are not reliant on the azslurm cost feature or you dont use it then we suggest just commenting out the PrologSlurmctld line in /etc/slurm/slurm.conf. That will just not run the script.

jhrmnn commented 5 months ago

Thanks for confirming! Looking forward to the next release.

But i just wanted to clarify that it runs on every job launch and not node launch.

That makes sense, I'm running 1-node jobs, so the distinction wasn't clear to me.