aws-samples / aws-eda-slurm-cluster

AWS Slurm Cluster for EDA Workloads
MIT No Attribution
28 stars 7 forks source link

module load <clusterName> sets environment variables that override values in the sbatch submission script. #223

Closed gwolski closed 5 months ago

gwolski commented 5 months ago

According to the slurm sbatch documentation:

NOTE: Environment variables will override any options set in a batch script, and command line options will override any environment variables.

The module load feature sets certain variables that I don't think it should. I would suggest we not set any of the variables - this should be left to the user or their scripts. I use scripts to submit with #SBATCH lines and they are ignored if environment variables are set.

The one that gave me grief today is the SBATCH_MEM_PER_NODE=100M

I would suggest we not set these, rather document their existence for users to modify if they want.
Here are the set variables I found ( I replaced my cluster name with ):

SLURM_CPUS_PER_TASK=1 SLURM_TIMELIMIT=1:0:0 SLURM_CPUS_PER_TASK_SET= SLURM_CLUSTER_NAME= SLURM_TIMELIMIT_SET= SLURM_CONF=/opt/slurm//etc/slurm.conf SLURM_MEM_PER_NODE_SET= SLURM_PARTITION=batch SLURM_PARTITION_SET=

SBATCH_PARTITION=batch SBATCH_MEM_PER_NODE_SET= SBATCH_TIMELIMIT_SET= SBATCH_PARTITION_SET= SBATCH_REQUEUE_SET= SBATCH_TIMELIMIT=1:0:0 SBATCH_REQUEUE= SBATCH_MEM_PER_NODE=100M

The setting of these variables is not documented on the documentation pages, nor how to to modify them. Can we modify them somehow as part of the source that generates the module files?

I found them set in

/opt/slurm//config/modules/modulefiles/Rocky/8/x86_64//3.8.0

and I believe the source is in:

./source/resources/playbooks/roles/ParallelClusterHeadNode/templates/opt/slurm/modules/modulefiles/slurm/.template

cartalla commented 5 months ago

From https://slurm.schedmd.com/sbatch.html

NOTE: Environment variables will override any options set in a batch script, and command line options will override any environment variables.

My intent was to make sure that defaults exist, but not to override variables set in the batch files. I will take those out of the modulefile and see if I can come up with a better way to making sure that defaults are specified. Probably the most important one is the time limit because you don't want to allow users to submit a job and then have it hang and consume resources until someone notices.

gwolski commented 5 months ago

I would argue it's not our job to protect the user from these mistakes. They will notice in time and probably only make the mistake once.

Unfortunately I think it's a poor design choice on the part of slurm that environment variables override script variables. I would have designed it so the order of precedence is environment variables, script variables, command line variables. This allows for defaults, a script to have better values, and if someone needs to do a one-off to try something, the command line for sbatch can override things. But then again, they didn't ask me and I've unable to glean their intent for this choice.

You get one vote from me to not have defaults.