Open joakimkjellsson opened 8 months ago
Hi @joakimkjellsson,
well, that the -c
works without problems doesn't surprise me. That will just get you all to way to generating the idealised SLURM script and then will stop without actually using it. If you then submit by hand, you have full control over what you actually tell SLURM to do beyond the options pre-defined in the header. The "automatic" submission might then do more things; unfortunately without asking you first.
I was unfortunately not involved in the implementing the submission for jobs using the heterogenous options, but that is certainly in the wheelhouse of @JanStreffing, I believe. Maybe he can give some hints.
If it's critical one of our team will look into it in more detail, just let us know!
Best Paul
If you are running a very modern version of slurm, I would try to turn taskset off and use hetjob:
computer:
taskset: true
In this case no hostfile is used, but rather a joined multi-line command for srun is created and send off.
@JanStreffing do you have an example on how to do this? For me, setting
computer:
taskset: true
or false
made no difference in how the SLURM runscript was generated.
Maybe the geomar_dev
branch is too far from release
? What branch are you using?
That is surprising. I would have though that you get something like:
#!/bin/bash
#SBATCH --time=00:20:00
#SBATCH --output=/work/ab0246/a270092/runtime/awicm3-v3.2//notaskset/log/notaskset_awicm3_compute_20000101-20000101_%j.log --error=/work/ab0246/a270092/runtime/awicm3-v3.2//notaskset/log/notaskset_awicm3_compute_20000101-20000101_%j.log
#SBATCH --job-name=notaskset
#SBATCH --account=ab0246
#SBATCH --mail-type=NONE
#SBATCH --exclusive
#SBATCH --nodes=1
#SBATCH --partition=compute
#SBATCH hetjob
#SBATCH --nodes=1
#SBATCH --partition=compute
#SBATCH hetjob
#SBATCH --nodes=1
#SBATCH --partition=compute
#SBATCH hetjob
#SBATCH --nodes=1
#SBATCH --partition=compute
.
.
.
echo $(date +"%a %b %e %T %Y") : compute 1 2000-01-01T00:00:00 3485944 - start >> /work/ab0246/a270092/runtime/awicm3-v3.2//notaskset/log//notaskset_awicm3.log
cd /work/ab0246/a270092/runtime/awicm3-v3.2//notaskset/run_20000101-20000101/work/
time srun -l \
--hint=nomultithread --nodes=1 --ntasks=128 --ntasks-per-node=128 --cpus-per-task=1 --export=ALL,OMP_NUM_THREADS=1 ./fesom \
: --hint=nomultithread --nodes=1 --ntasks=128 --ntasks-per-node=128 --cpus-per-task=1 --export=ALL,OMP_NUM_THREADS=1 ./oifs -v ecmwf -e awi3 \
: --hint=nomultithread --nodes=1 --ntasks=1 --ntasks-per-node=1 --cpus-per-task=128 --export=ALL,OMP_NUM_THREADS=128 ./rnfma \
: --hint=nomultithread --nodes=1 --ntasks=4 --ntasks-per-node=4 --cpus-per-task=32 --export=ALL,OMP_NUM_THREADS=32 ./xios.x 2>&1 &
And no hostfile in the work folder at all.
I merged release
into geomar_dev
a few weeks ago, so this shouldn't be the problem.
Hi @joakimkjellsson,
By setting computer.taskset: False
either in the runscript or in the focioifs.yaml
you should get what @JanStreffing showed for the .run
file. I am using this in glogin
and I get it. Can you give it a second try and let me know how it goes?
Another important point, at the time we implemented the hetjobs support the hetjob separator was called in some systems packjobs
and in other systems hetjobs
. In the new versions this should be always hetjobs
so I have just updated both glogin
and blogin
here #1176. If you want to get this changes in your branch without having to merge you can simply do git cherry-pick -x fc481884e2454a037cad43cd5436c29a1fa55ae3
.
Please, let me know if that solves the problem.
Good afternoon all
SLURM was recently updated on HLRN (glogin). This happended 1 Feb 2024 and the update was from v23.02.7 to v23.11.3. Since the update, the following command
returns the error
I've traced the problem to the heterogeneous parallelisation, i.e. if
omp_num_threads: 1
or is unset for all model components, the problem does not happen.What is very strange is that
esm_runscripts
is unable to submit jobs, but running it with the-c
option and then submitting the job manually, i.e.works fine.
I emailed back and forth with HLRN support and they had an idea of what causes the problem, but no solution. In the end, they asked how our SLURM scripts are generated, which of course is not easy to explain since it depends on several functions split across multiple files in ESM-Tools ;-)
Their full reply is attached, but the TL;DR is: 1) Heterogeneous parallelization forces the creation of a hostfile, which is pointed to by
SLURM_HOSTFILE
2) The existence ofSLURM_HOSTFILE
forces--distribution=arbitrary:...
3)SBATCH --nodes
tries to set the number of nodes, but that conflicts with theSLURM_HOSTFILE
, so SLURM has no idea what to do.As HLRN are unable to solve this problem, I'm wondering if anyone (maybe @mandresm ?) has some idea on how to run heterogeneous parallelisation without the
SLURM_HOSTFILE
. As far as I know, the problem only applies to glogin, but it could become a problem on other machines in the future. Or maybe it has already been solved on other machines and we can port that solution to glogin?I don't expect anyone to be able to solve this issue, so I'm posting it here mostly to make people aware of the problem. I'm tagging @seb-wahl and @nwieters so that they are aware as well.
Best wishes Joakim hlrn_reply.txt