esm-tools / esm_tools

Simple Infrastructure for Earth System Simulations
https://esm-tools.github.io/
GNU General Public License v2.0
25 stars 12 forks source link

sbatch header wrong on ECMWF-atos when taskset=false #1214

Open tsemmler05 opened 1 week ago

tsemmler05 commented 1 week ago

When submitting the esm run script /scratch/duts/runtime/awicm3-frontiers-xios/taskfalse1/run_20000101-20000101/scripts/awicm3-frontiers-xios-ecmwf-atos-TCO95L91-CORE2_initial.yaml, I am getting a wrong sbatch header:

!/usr/bin/bash

SBATCH --nodes=3

SBATCH

SBATCH

SBATCH --nodes=2

SBATCH

SBATCH

SBATCH --nodes=1

SBATCH

SBATCH

SBATCH --nodes=1

SBATCH

In other words, a lot is missing. You can see the resulting .run script here: /scratch/duts/runtime/awicm3-frontiers-xios/taskfalse1/run_20000101-20000101/scripts/taskfalse1_compute_20000101-20000101.run

I am using esm_tools version 6.37.2

mandresm commented 1 week ago

Hi @tsemmler05, this is most likely a problem with the configuration of ecmwf.yaml as it is probably missing the heterogeneous flags:

https://github.com/esm-tools/esm_tools/blob/90c23886c53bfa0117893e3830b21a43f9c99ee9/configs/machines/levante.yaml#L46C14-L46C20

Once you add that you should then something like:

#!/usr/bin/bash
#SBATCH --nodes=3
#SBATCH hetjob
#SBATCH --nodes=2
#SBATCH hetjob
#SBATCH --nodes=1
#SBATCH hetjob
#SBATCH --nodes=1

Anyhow, we will meet later today for talking about this and helping you configure the machine :)

tsemmler05 commented 1 week ago

Well, it looks like this now:

!/usr/bin/bash

SBATCH --nodes=1

SBATCH

SBATCH hetjob

SBATCH --nodes=1

SBATCH

SBATCH hetjob

SBATCH --nodes=1

SBATCH

SBATCH hetjob

SBATCH --nodes=1

SBATCH

Probably we will have to get rid off the extra #SBATCH?

tsemmler05 commented 1 week ago

No matter if I remove the extra #SBATCH lines or not, I am getting this error when submitting the job:

sbatch: error: unrecognized arguments: hetjob_compute_20000101-20000131.run

JanStreffing commented 1 week ago

some empty #SBATCH are still there. Can you try to delete them manually and submit?

Maybe the other is the partition_flag from here? https://github.com/esm-tools/esm_tools/blob/90c23886c53bfa0117893e3830b21a43f9c99ee9/src/esm_runscripts/slurm.py#L186-L222