sbatch: error: --nodes is incompatable with --distribution=arbitrary

joakimkjellsson commented 8 months ago

Good afternoon all

SLURM was recently updated on HLRN (glogin). This happended 1 Feb 2024 and the update was from v23.02.7 to v23.11.3. Since the update, the following command

esm_runscripts focioifs-pictrl-restart-glogin.yaml

returns the error

sbatch: error: --nodes is incompatable with --distribution=arbitrary

I've traced the problem to the heterogeneous parallelisation, i.e. if omp_num_threads: 1 or is unset for all model components, the problem does not happen.

What is very strange is that esm_runscripts is unable to submit jobs, but running it with the -c option and then submitting the job manually, i.e.

sbatch <script name>.run

works fine.

I emailed back and forth with HLRN support and they had an idea of what causes the problem, but no solution. In the end, they asked how our SLURM scripts are generated, which of course is not easy to explain since it depends on several functions split across multiple files in ESM-Tools ;-)

Their full reply is attached, but the TL;DR is: 1) Heterogeneous parallelization forces the creation of a hostfile, which is pointed to by SLURM_HOSTFILE 2) The existence of SLURM_HOSTFILE forces --distribution=arbitrary:... 3) SBATCH --nodes tries to set the number of nodes, but that conflicts with the SLURM_HOSTFILE, so SLURM has no idea what to do.

As HLRN are unable to solve this problem, I'm wondering if anyone (maybe @mandresm ?) has some idea on how to run heterogeneous parallelisation without the SLURM_HOSTFILE. As far as I know, the problem only applies to glogin, but it could become a problem on other machines in the future. Or maybe it has already been solved on other machines and we can port that solution to glogin?

I don't expect anyone to be able to solve this issue, so I'm posting it here mostly to make people aware of the problem. I'm tagging @seb-wahl and @nwieters so that they are aware as well.

Best wishes Joakim hlrn_reply.txt

pgierz commented 8 months ago

Hi @joakimkjellsson,

well, that the -c works without problems doesn't surprise me. That will just get you all to way to generating the idealised SLURM script and then will stop without actually using it. If you then submit by hand, you have full control over what you actually tell SLURM to do beyond the options pre-defined in the header. The "automatic" submission might then do more things; unfortunately without asking you first.

I was unfortunately not involved in the implementing the submission for jobs using the heterogenous options, but that is certainly in the wheelhouse of @JanStreffing, I believe. Maybe he can give some hints.

If it's critical one of our team will look into it in more detail, just let us know!

Best Paul

JanStreffing commented 8 months ago

If you are running a very modern version of slurm, I would try to turn taskset off and use hetjob:

computer:
    taskset: true

In this case no hostfile is used, but rather a joined multi-line command for srun is created and send off.

joakimkjellsson commented 8 months ago

@JanStreffing do you have an example on how to do this? For me, setting

computer: 
    taskset: true

or false made no difference in how the SLURM runscript was generated. Maybe the geomar_dev branch is too far from release? What branch are you using?

JanStreffing commented 8 months ago

That is surprising. I would have though that you get something like:


#!/bin/bash
#SBATCH --time=00:20:00
#SBATCH --output=/work/ab0246/a270092/runtime/awicm3-v3.2//notaskset/log/notaskset_awicm3_compute_20000101-20000101_%j.log --error=/work/ab0246/a270092/runtime/awicm3-v3.2//notaskset/log/notaskset_awicm3_compute_20000101-20000101_%j.log
#SBATCH --job-name=notaskset
#SBATCH --account=ab0246
#SBATCH --mail-type=NONE
#SBATCH --exclusive
#SBATCH --nodes=1
#SBATCH --partition=compute
#SBATCH hetjob
#SBATCH --nodes=1
#SBATCH --partition=compute
#SBATCH hetjob
#SBATCH --nodes=1
#SBATCH --partition=compute
#SBATCH hetjob
#SBATCH --nodes=1
#SBATCH --partition=compute

.
.
.
echo $(date +"%a %b  %e %T %Y") : compute 1 2000-01-01T00:00:00 3485944 - start >> /work/ab0246/a270092/runtime/awicm3-v3.2//notaskset/log//notaskset_awicm3.log

cd /work/ab0246/a270092/runtime/awicm3-v3.2//notaskset/run_20000101-20000101/work/
time srun -l \
--hint=nomultithread  --nodes=1 --ntasks=128 --ntasks-per-node=128 --cpus-per-task=1 --export=ALL,OMP_NUM_THREADS=1 ./fesom \
: --hint=nomultithread  --nodes=1 --ntasks=128 --ntasks-per-node=128 --cpus-per-task=1 --export=ALL,OMP_NUM_THREADS=1 ./oifs -v ecmwf -e awi3 \
: --hint=nomultithread  --nodes=1 --ntasks=1 --ntasks-per-node=1 --cpus-per-task=128 --export=ALL,OMP_NUM_THREADS=128 ./rnfma \
: --hint=nomultithread  --nodes=1 --ntasks=4 --ntasks-per-node=4 --cpus-per-task=32 --export=ALL,OMP_NUM_THREADS=32 ./xios.x  2>&1 &

And no hostfile in the work folder at all.

seb-wahl commented 8 months ago

I merged release into geomar_dev a few weeks ago, so this shouldn't be the problem.

mandresm commented 6 months ago

Hi @joakimkjellsson,

By setting computer.taskset: False either in the runscript or in the focioifs.yaml you should get what @JanStreffing showed for the .run file. I am using this in glogin and I get it. Can you give it a second try and let me know how it goes?

Another important point, at the time we implemented the hetjobs support the hetjob separator was called in some systems packjobs and in other systems hetjobs. In the new versions this should be always hetjobs so I have just updated both glogin and blogin here #1176. If you want to get this changes in your branch without having to merge you can simply do git cherry-pick -x fc481884e2454a037cad43cd5436c29a1fa55ae3.

Please, let me know if that solves the problem.

esm-tools / esm_tools

sbatch: error: --nodes is incompatable with --distribution=arbitrary #1148