tsemmler05 commented 2 months ago

A bug in the esm-tool derived distribution of processors at ECMWF atos has been detected by Paul Dando from ECMWF after I reported extremely slow execution times of AWI-CM 3.1 on ECMWF-atos machine.

I am running esm-tools version 6.37.2.

Paul Dando has asked me to let the machine do the distribution of the processors in doing the following (citation of Paul Dando's support message), and this has led to an execution time comparable to DKRZ levante:

Hi Tido,

I hope you had a good break.

My suspicions were also that there was a restart file in the work directory that was being read when I re-ran and which caused the floating point exception. From what I can see from the output, it ran successfully the first time but then failed when I retried.

The main changes I needed to make were to the prog_fesom.sh, progoifs.sh, etc scripts in the work directory. I think there's a bug in these which means the taskset command is trying to run all threads on a single core. In fact, I don't think you need the taskset at all so I changed these prog*.sh scripts so that they just call, e.g., script_fesom.sh, directly. For example, for prog_fesom.sh, I have:

!/bin/sh

./scriptfesom.sh and similarly for the other prog*.sh scripts for oifs, xios and rnfmap. Alternatively, I suppose you could also just change the hostfile_srun file to call the ./script_fesom.sh, etc, scripts directly.

I also removed the part that creates the hostlist file in incoredtim_compute_20000101-20000103.run as you shouldn't need this either. Finally, I changed the srun command to:

time srun -l --kill-on-bad-exit=1 --cpu_bind=cores --distribution=cyclic:cyclic --multi-prog hostfile_srun 2>&1 & I was then going to play a little more with the cpu-bind and distribution options to see if I could find a better combination.

With this setup, I think you should also be able to set OMP_NUM_THREADS for the OpenIFS executable (although I didn't try this).

Please let me know if this setup also works for you.

Best regards

Paul

mandresm commented 2 months ago

Hi @tsemmler05,

If I am understanding Paul's message correctly, what he has done is to disable all the taskset logic that we have to enable heterogeneous parallelization. Not having heterogeneous parallelization available means you cannot use different OpenMP threads for the different binaries, which is in principle not desired: you want some OpenMP threads for OpenIFS, a different number for xios, and a different one for FESOM (possibly 1 for old versions of FESOM-2).

If you are okay with doing essentially what he did, you can simply set all omp_num_threads in all the components of the runscript to 1. This should remove any trace of taskset in your script. Then you can check if that simulation gives the expected speeds, comparable with Levante.

If it does, then we can try to see what's the problem with the taskset approach or even try SLURM's hetjobs feature which is meant to do precisely heterogenous parallelization. You can do that by setting computer.taskset: False. Until not so long ago SLURM had problems with hetjobs and that's why @JanStreffing and me opted for using the taskset approach as a default, but in the future, hetjobs should slowly become our preference, if it works.

Can you share with as a path to those files edited by Paul, and also to one of the really slow runs/simulations?

@JanStreffing can you please have a look at the message by Paul Dando, to see if you are understanding the same I am understanding?

tsemmler05 commented 2 months ago

I can confirm that Paul's method only works for OpenMP threads = 1. Regarding sharing the superslow simulation, I have to regenerate it because of ECMWF policy to remove data from /scratch after one month.

tsemmler05 commented 2 months ago

Actually, without carrying out the super slow simulation, I generated two directories to compare:

/scratch/duts/runtime/awicm3-v3.1/restlev960/run_21010101-21010103/scripts

/scratch/duts/runtime/awicm3-v3.1/restsuperslow/run_21010101-21010103/scripts

The first one is using Paul Dando's method, the second one the original method with taskset. You can compare the run scripts in the given directories as well as the prog_*.sh scripts in the corresponding work directories.

tsemmler05 commented 2 months ago

Question is if one can fix the taskset problem in case it proves to be difficult to use hetjob on ECMWF atos.

tsemmler05 commented 2 months ago

This is the reply from Paul Dando (trying to fix taskset problem):

Hi Tido, I'm surprised you can't have different OMP_NUM_THREADS for different executables or that you need to use the hetjobs option. What you can do with srun is to pass different numbers of tasks and threads to each executable. See, for example, the heterogeneous and MPMD example in

HPC2020: Submitting a parallel job:

srun -N1 -n 64 -c 2 executable1 : -N2 -n 64 -c 4 executable2 This was one of the things I was going to try with your small test (where "executable[12]" would be your scripts for running fesom, oifs, etc, with OMP_NUM_THREADS set to SLURM_CPUS_PER_TASK in those scripts which I think should take the values from the -c srun option. I wasn't entirely sure it could work. But I can try this with a small noddy test example I have. If you prefer, then I think you can reinstate the tasksets. The reason I removed these is because I noticed an error in the output from setting then up which meant you were running with, e.g., taskset=0-0: 256: /lus/h2resw01/scratch/duts/runtime/awicm3-frontiers-xios/incoredtim/run_20000101-20000103/work/./prog_rnfmap.sh: line 3: ((: init = : syntax error: operand expected (error token is "= ") 257: /lus/h2resw01/scratch/duts/runtime/awicm3-frontiers-xios/incoredtim/run_20000101-20000103/work/./prog_xios.sh: line 3: ((: init = : syntax error: operand expected (error token is "= ") 256: rnfmap taskset -c 0-127 257: xios taskset -c 0-31 85: /lus/h2resw01/scratch/duts/runtime/awicm3-frontiers-xios/incoredtim/run_20000101-20000103/work/./prog_fesom.sh: line 3: ((: init = : syntax error: operand expected (error token is "= ") 101: fesom taskset -c 0-0 101: /lus/h2resw01/scratch/duts/runtime/awicm3-frontiers-xios/incoredtim/run_20000101-20000103/work/./prog_fesom.sh: line 3: ((: init = : syntax error: operand expected (error token is "= ") 80: fesom taskset -c 0-0 So I think both fesom and oifs were running with taskset=0-0. I had a feeling this meant all tasks were running on just one core and this is why it was so slow. Looking at the script that sets this up, I think perhaps PMI_RANK isn't set when these scripts run. But, to be honest, I don't really understand what this is trying to do or what the expected taskset command should be. Best regards - and have a good evening Paul

mandresm commented 2 months ago

Hi @tsemmler05,

srun -N1 -n 64 -c 2 executable1 : -N2 -n 64 -c 4 executable2 This was one of the things I was going to try with your small test (where "executable[12]" would be your scripts for running fesom, oifs, etc, with OMP_NUM_THREADS set to SLURM_CPUS_PER_TASK in those scripts which I think should take the values from the -c srun option. I wasn't entirely sure it could work.

If this works, I can very easily implemented in ESM-Tools.

The reason I removed these is because I noticed an error in the output from setting then up which meant you were running with, e.g., taskset=0-0: 256: /lus/h2resw01/scratch/duts/runtime/awicm3-frontiers-xios/incoredtim/run_20000101-20000103/work/./prog_rnfmap.sh: line 3: ((: init = : syntax error: operand expected (error token is "= ")

This, however, might be a clue about why it is not working with the taskset approach. I will compare what you have in prog_rnfmap.sh and what we are generating in another HPC, and see if the problem is there.

So two things going in parallel, Paul trying to give a better solution that could be easily implemented in ESM-Tools, and me checking if there is something weird in our prog_ scripts in ecmwf-atos in comparison with levante for example.

tsemmler05 commented 2 months ago

Hi @tsemmler05,

srun -N1 -n 64 -c 2 executable1 : -N2 -n 64 -c 4 executable2 This was one of the things I was going to try with your small test (where "executable[12]" would be your scripts for running fesom, oifs, etc, with OMP_NUM_THREADS set to SLURM_CPUS_PER_TASK in those scripts which I think should take the values from the -c srun option. I wasn't entirely sure it could work.

If this works, I can very easily implemented in ESM-Tools.

The reason I removed these is because I noticed an error in the output from setting then up which meant you were running with, e.g., taskset=0-0: 256: /lus/h2resw01/scratch/duts/runtime/awicm3-frontiers-xios/incoredtim/run_20000101-20000103/work/./prog_rnfmap.sh: line 3: ((: init = : syntax error: operand expected (error token is "= ")

This, however, might be a clue about why it is not working with the taskset approach. I will compare what you have in prog_rnfmap.sh and what we are generating in another HPC, and see if the problem is there.

So two things going in parallel, Paul trying to give a better solution that could be easily implemented in ESM-Tools, and me checking if there is something weird in our prog_ scripts in ecmwf-atos in comparison with levante for example.

O.k., regarding the prog_*.sh scripts you can check in /scratch/duts/runtime/awicm3-v3.1/restsuperslow/run_21010101-21010103/work/; regarding the srun command I guess I could try to manually change and see if I get a good performance; I guess this would be without hetjob and without taskset if I understand this correctly.

mandresm commented 2 months ago

Looking at the script that sets this up, I think perhaps PMI_RANK isn't set when these scripts run. But, to be honest, I don't really understand what this is trying to do or what the expected taskset command should be.

I was playing around echoing echo $PMIX_RANK from a toy script launched via srun and it works in ecmwf-atos. This means that *the following lines in the prog_.sh should not the problem**:

if [ -z ${PMI_RANK+x} ]; then PMI_RANK=$PMIX_RANK; fi
(( init = $PMI_RANK ))

This is what I did to check this out:

salloc --ntasks 64

Create a file deleteme.sh

#!/bin/bash

echo $PMIX_RANK

Then I change the permisions of that file to be executable and run it with srun:

$ srun -n 5 deleteme.sh
0
1
2
4
3

tsemmler05 commented 2 months ago

Hi! I tried out the following run command:

time srun -N45 -n 5760 -c 1 fesom : -N10 -n 640 -c 2 oifs : -N1 -n 1 -c 128 rnfma : -N1 -n 4 -c 32 xios.x 2>&1 &

But I'm getting the following seg fault:

[ac5-1023:1933764:0:1934213] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x145bc20fd520)

This is the same error that I got before when trying

time srun -l --kill-on-bad-exit=1 --cpu_bind=cores --distribution=cyclic:cyclic --multi-prog hostfile_srun 2>&1 &

(compare directories /scratch/duts/runtime/awicm3-frontiers-xios/omp4_1_1/run_20000101-20000101/scripts and /scratch/duts/runtime/awicm3-v3.1/restlev640omp2/run_21010101-21010103/scripts)

So maybe there is a general problem at ECMWF when trying to run OpenIFS with OMP_threads unequal 1?

tsemmler05 commented 2 months ago

Looking at the script that sets this up, I think perhaps PMI_RANK isn't set when these scripts run. But, to be honest, I don't really understand what this is trying to do or what the expected taskset command should be.

I was playing around echoing echo $PMIX_RANK from a toy script launched via srun and it works in ecmwf-atos. This means that *the following lines in the prog_.sh should not the problem**:
if [ -z ${PMI_RANK+x} ]; then PMI_RANK=$PMIX_RANK; fi
(( init = $PMI_RANK ))
This is what I did to check this out:
salloc --ntasks 64
Create a file deleteme.sh
#!/bin/bash

echo $PMIX_RANK
Then I change the permisions of that file to be executable and run it with srun:
$ srun -n 5 deleteme.sh
0
1
2
4
3

So, does that mean that I could use the prog_*.sh files as they are generated by esm_tools and that I should only change the srun command (because without changing that I am getting the super slow simulation)?

tsemmler05 commented 2 months ago

Paul gave me this suggestion, not sure about his question about MPI_COMM_WORLD?

Hi again Tido,

Sorry - our emails crossed !

So the "heterogeneous job" option doesn't work either. Do your executables share the MPI_COMM_WORLD or do each run with a separate MPI_COMM_WORLD ? Maybe you can try to go back to using taskset but with SLURM_PROCID instead of PMI_RANK ?

I did also try to use mpirun instead of srun as I think PMI_RANK should be set by mpirun. However, I've not yet been able to get this to work.

I'll also try to investigate if there are other options you can try.

Best regards

Paul

tsemmler05 commented 2 months ago

But how would I use SLURM_PROCID instead of PMIRANK? Since in the prog*.sh scripts PMI_RANK is used as a variable and I don't know where in the esm scripts this is coming from.

mandresm commented 2 months ago

So, does that mean that I could use the prog_*.sh files as they are generated by esm_tools and that I should only change the srun command (because without changing that I am getting the super slow simulation)?

To me it means that in principle it should work, but maybe I'm not getting Paul's point.

mandresm commented 2 months ago

Do your executables share the MPI_COMM_WORLD

I'm pretty sure think the answer is yes

I did also try to use mpirun instead of srun as I think PMI_RANK should be set by mpirun.

It should work with srun because the line is not only about PMI_RANK, there is an if PMIX_RANK exists it uses that as a value for PMI_RANK

But how would I use SLURM_PROCID instead of PMIRANK? Since in the prog*.sh scripts PMI_RANK is used as a variable and I don't know where in the esm scripts this is coming from.

For now you can do this manually, if it works then I can implemented in ESM-Tools.

One last question @tsemmler05, do you have an example with taskset but with the srun command containing --cpu_bind=cores --distribution=cyclic:cyclic? If not that might be the problem. I remember in levante cyclic binding was rather important.

JanStreffing commented 2 months ago

Just to confirm, all executables share one MPI_COMM_WORLD. The srun command on levante does not use any cpu_bind nor distribution: time srun -l --hint=nomultithread --multi-prog hostfile_srun 2>&1 & See: /work/ab0246/a270092/runtime/awicm3-v3.2/main_ice_thermo/

tsemmler05 commented 2 months ago

Do your executables share the MPI_COMM_WORLD

I'm pretty sure think the answer is yes

I did also try to use mpirun instead of srun as I think PMI_RANK should be set by mpirun.

It should work with srun because the line is not only about PMI_RANK, there is an if PMIX_RANK exists it uses that as a value for PMI_RANK

But how would I use SLURM_PROCID instead of PMIRANK? Since in the prog*.sh scripts PMI_RANK is used as a variable and I don't know where in the esm scripts this is coming from.

For now you can do this manually, if it works then I can implemented in ESM-Tools.

One last question @tsemmler05, do you have an example with taskset but with the srun command containing --cpu_bind=cores --distribution=cyclic:cyclic? If not that might be the problem. I remember in levante cyclic binding was rather important.

I can make such an example - at this stage I only used taskset without cyclic distribution and with cpu_bind=none. I'll let you know the outcome.

tsemmler05 commented 2 months ago

This results again in the super slow simulation - makes sense because of the bug assigning the tasks properly that Paul Dando had spotted:

Excerpt from the log file:

1019: /lus/h2resw01/scratch/duts/runtime/awicm3-v3.1/restlevtaskcyclic/run_21010101-21010103/work/./prog_fesom.sh: line 3: ((: init = : syntax error: operand expected (error token is "= ") 524: fesom taskset -c 0-0

JanStreffing commented 2 months ago

We seem to have made some headway. We found that with the default process management interface for srun on ECMWF Atos (pmix_v3) does not correctly set the PMIX_RANK. Why, I do not know.

With srun --mpi=list, it is possible to get a list of installed PMIs. We saw that pmi2 was installed, and tried that out. Here, PMI_RANK is set, which should enable us to use the taskset solution. Tests are ongoing.

mandresm commented 2 months ago

We seem to have made some headway. We found that with the default process management interface for srun on ECMWF Atos (pmix_v3) does not correctly set the PMIX_RANK. Why, I do not know.

With srun --mpi=list, it is possible to get a list of installed PMIs. We saw that pmi2 was installed, and tried that out. Here, PMI_RANK is set, which should enable us to use the taskset solution. Tests are ongoing.

Interesting. Let me know how it goes and if I need to do any changes in the ESM-Tools source code. I'm still blocking Friday (from 11:30 onwards) in case I need to work on ESM-Tools for this topic

tsemmler05 commented 2 months ago

This change is easy to implement into esm_tools: one just has to put in

launcher_flags: "--mpi=pmi2 -l --kill-on-bad-exit=1 --cpu_bind=cores --distribution=cyclic:cyclic"

to ecmwf-atos.yaml.

However, to run the model with taskset takes twice as long as without taskset (when the number of OMP threads is set to 1; at this stage setting it to something greater than 1 results in a seg fault). Therefore, I am considering to abandon the taskset method for ecmwf-atos, especially since I anyway can't use OMP threads greater than 1 in OpenIFS - the whole motivation to try to get taskset running.

In that case, if Paul Dando doesn't come up with any solution to these issues, I might need some help to take out the parts for ECMWF atos in prog_*.sh that do the processor redistribution, and also to take out the corresponding parts from the generated run script. @mandresm: thanks for blocking Friday from 11:30 for getting this to work.

mandresm commented 2 months ago

Okay, then I'll contact you via Webex at 11:30 (German time) on Friday to catch up on the status.

esm-tools / esm_tools

Bug in distribution of processors at ECMWF atos #1212

!/bin/sh