Open tsemmler05 opened 1 week ago
Hi @tsemmler05, this is most likely a problem with the configuration of ecmwf.yaml
as it is probably missing the heterogeneous flags:
Once you add that you should then something like:
#!/usr/bin/bash
#SBATCH --nodes=3
#SBATCH hetjob
#SBATCH --nodes=2
#SBATCH hetjob
#SBATCH --nodes=1
#SBATCH hetjob
#SBATCH --nodes=1
Anyhow, we will meet later today for talking about this and helping you configure the machine :)
Well, it looks like this now:
Probably we will have to get rid off the extra #SBATCH?
No matter if I remove the extra #SBATCH lines or not, I am getting this error when submitting the job:
sbatch: error: unrecognized arguments: hetjob_compute_20000101-20000131.run
some empty #SBATCH are still there. Can you try to delete them manually and submit?
Maybe the other is the partition_flag from here? https://github.com/esm-tools/esm_tools/blob/90c23886c53bfa0117893e3830b21a43f9c99ee9/src/esm_runscripts/slurm.py#L186-L222
When submitting the esm run script /scratch/duts/runtime/awicm3-frontiers-xios/taskfalse1/run_20000101-20000101/scripts/awicm3-frontiers-xios-ecmwf-atos-TCO95L91-CORE2_initial.yaml, I am getting a wrong sbatch header:
!/usr/bin/bash
SBATCH --nodes=3
SBATCH
SBATCH
SBATCH --nodes=2
SBATCH
SBATCH
SBATCH --nodes=1
SBATCH
SBATCH
SBATCH --nodes=1
SBATCH
In other words, a lot is missing. You can see the resulting .run script here: /scratch/duts/runtime/awicm3-frontiers-xios/taskfalse1/run_20000101-20000101/scripts/taskfalse1_compute_20000101-20000101.run
I am using esm_tools version 6.37.2