Closed gwolski closed 5 months ago
From https://slurm.schedmd.com/sbatch.html
NOTE: Environment variables will override any options set in a batch script, and command line options will override any environment variables.
My intent was to make sure that defaults exist, but not to override variables set in the batch files. I will take those out of the modulefile and see if I can come up with a better way to making sure that defaults are specified. Probably the most important one is the time limit because you don't want to allow users to submit a job and then have it hang and consume resources until someone notices.
I would argue it's not our job to protect the user from these mistakes. They will notice in time and probably only make the mistake once.
Unfortunately I think it's a poor design choice on the part of slurm that environment variables override script variables. I would have designed it so the order of precedence is environment variables, script variables, command line variables. This allows for defaults, a script to have better values, and if someone needs to do a one-off to try something, the command line for sbatch can override things. But then again, they didn't ask me and I've unable to glean their intent for this choice.
You get one vote from me to not have defaults.
According to the slurm sbatch documentation:
NOTE: Environment variables will override any options set in a batch script, and command line options will override any environment variables.
The module load feature sets certain variables that I don't think it should. I would suggest we not set any of the variables - this should be left to the user or their scripts. I use scripts to submit with #SBATCH lines and they are ignored if environment variables are set.
The one that gave me grief today is the SBATCH_MEM_PER_NODE=100M
I would suggest we not set these, rather document their existence for users to modify if they want.):
Here are the set variables I found ( I replaced my cluster name with
SLURM_CPUS_PER_TASK=1 SLURM_TIMELIMIT=1:0:0 SLURM_CPUS_PER_TASK_SET= SLURM_CLUSTER_NAME=
SLURM_TIMELIMIT_SET=
SLURM_CONF=/opt/slurm//etc/slurm.conf
SLURM_MEM_PER_NODE_SET=
SLURM_PARTITION=batch
SLURM_PARTITION_SET=
SBATCH_PARTITION=batch SBATCH_MEM_PER_NODE_SET= SBATCH_TIMELIMIT_SET= SBATCH_PARTITION_SET= SBATCH_REQUEUE_SET= SBATCH_TIMELIMIT=1:0:0 SBATCH_REQUEUE= SBATCH_MEM_PER_NODE=100M
The setting of these variables is not documented on the documentation pages, nor how to to modify them. Can we modify them somehow as part of the source that generates the module files?
I found them set in
/opt/slurm//config/modules/modulefiles/Rocky/8/x86_64//3.8.0
and I believe the source is in:
./source/resources/playbooks/roles/ParallelClusterHeadNode/templates/opt/slurm/modules/modulefiles/slurm/.template