Open joakimkjellsson opened 3 months ago
Hi @joakimkjellsson,
did you try setting computer.launcher
to mpirun
? You can do that in your runscript. That will swap out your srun <OPTIONS>
to use mpirun
instead.
I'd need to look more deeply into how to set the actual options. That would need a code change.
Hi @pgierz
Sorry forgot to mention this. So if I do that (launcher: mpirun
and launcher_flags: ""
) my launch command becomes:
time mpirun $(cat hostfile_srun) 2>&1 &
so it would use mpirun
but give executables in the format expected by srun
.
At the moment, hostfile_srun
is:
0-287 ./oifs -e ECE3
288-719 ./oceanx
720-739 ./xios.x
740-740 ./rnfma
but I would need it to be
-np 288 ./oifs -e ECE3 : -np 432 ./oceanx : -np 20 ./xios.x : -np 1 ./rnfma
The function write_one_hostfile
in mpirun.py
seems to do that, but it never gets called. Almost as if someone started working on this but never finished ;-)
I would like to have two functions, write_one_hostfile_srun
and write_one_hostfile_mpirun
, and have some kind of if
statement in slurm.py
to choose which one to use.
/J
@joakimkjellsson What branch are you on? I'll start from that one, should be quick enough to program.
@pgierz no worries. I've already coded it in. My main question was whether someone had already done it or was planning to do it, in which case I would not do it :-)
I renamed the old write_one_hostfile
to write_one_hostfile_srun
and made a new write_one_hostfile
:
def write_one_hostfile(self, hostfile, config):
"""
Gathers previously prepared requirements
(batch_system.calculate_requirements) and writes them to ``self.path``.
Suitable for mpirun launcher
"""
# make an empty string which we will append commands to
mpirun_options = ""
for model in config["general"]["valid_model_names"]:
end_proc = config[model].get("end_proc", None)
start_proc = config[model].get("start_proc", None)
print(' model ', model)
print(' start_proc ', start_proc)
print(' end_proc ', end_proc)
# a model component like oasis3mct does not need cores
# since its technically a library
# So start_proc and end_proc will be None. Skip it
if start_proc == None or end_proc == None:
continue
# number of cores needed
no_cpus = end_proc - start_proc + 1
print(' no_cpus ',no_cpus)
if "execution_command" in config[model]:
command = "./" + config[model]["execution_command"]
elif "executable" in config[model]:
command = "./" + config[model]["executable"]
else:
continue
# the mpirun command is set here.
mpirun_options += (
" -np %d %s :" % (no_cpus, command)
)
mpirun_options = mpirun_options[:-1] # remove trailing ":"
with open(hostfile, "w") as hostfile:
hostfile.write(mpirun_options)
Already made a few test runs and it seems to work. I'll do some more tests. Then it will end up in the feature/blogin-rockylinux9
branch, where I'm trying to get FOCI-OpenIFS running on glogin
.
/J
Perfect, thanks for figuring that out. Let us know when you are ready to merge and we can see if we can improve in terms of generalization of the write_one_hostfile
function.
I made the change to slurm.py
: https://github.com/esm-tools/esm_tools/commit/058fcf9892048e6833efc66a92aa9bc8d74d1f70#diff-0c204676837e94ca027f7a61a71d27914ea3a6b8071d5d3dc4c7791dfa5eb15b
When Sebastian is back we might do some cleaning etc and then merge this fix branch into geomar_dev
. Then that can merge into release
.
Cheers! /J
Good afternoon all
glogin
(GWDG Emmy) has undergone some hardware and software upgrades recently. Since the upgrade, I find jobs launched withsrun
are considerably slower than jobs launched withmpirun
. The support team recommendsmpirun
. So I'd like to usempirun
.But I can't work out if ESM-Tools can do it. There is an
mpirun.py
file with a function to write a hostfile formpirun
, but as far as I can see this function is never used. If we use SLURM, then it seems that ESM-Tools will always build ahostfile_srun
and then launch withsrun
.My idea would be to have something like this in
slurm.py
: Line 65 is currently:but it should be
and then the two functions would be slightly different. One benefit with
mpirun
would be that heterogeneous parallelisation becomes very easy since we can do:although I'm not sure and would have to double-check exactly how it should be done on
glogin
.Before I venture down this path though, I just want to check: Is it already possible to use
mpirun
but I'm just too dense to figure out how? If not, is someone else already working on a similar solution?Cheers Joakim