esm-tools / esm_tools

Simple Infrastructure for Earth System Simulations
https://esm-tools.github.io/
GNU General Public License v2.0
25 stars 12 forks source link

slurm batch configs ignored by esm tools #470

Open chrisdane opened 3 years ago

chrisdane commented 3 years ago

Describe the bug The esm_tools on release ignore the runscript entries

general:
        mail_type: "FAIL"
        mail_user: "cdanek@awi.de"
        partition: "compute,compute2"

The resulting .sad file has the mistral default values:

#SBATCH --partition=compute
#SBATCH --mail-type=NONE

See runscript and .sad file:

/work/ab0246/a270073/out/fesom-2.0/test/run_20010101-20010131/scripts/fesom2-mistral-initial-monthly.run.yaml
/work/ab0246/a270073/out/fesom-2.0/test/run_20010101-20010131/scripts/test_compute_20010101-20010131.sad

To Reproduce Steps to reproduce the behavior:

esm_runscripts /work/ab0246/a270073/out/fesom-2.0/test/run_20010101-20010131/scripts/fesom2-mistral-initial-monthly.run.yaml -e test -c

Expected behavior User parameters should always win if valid.

System (please complete the following information):

esm_analysis ├─ version: 0.3.0 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:

esm_archiving ├─ version: 5.0.0 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:

esm_calendar ├─ version: 5.0.0 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:

esm_database ├─ version: 5.0.0 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:

esm_environment ├─ version: 5.1.3 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:

esm_master ├─ version: 5.1.6 ├─ path: /mnt/lustre01/pf/a/a270073/esm/src/esm_master ├─ branch: develop └─ tags: v5.1.6-10-g96c2b66

esm_motd ├─ version: 5.0.2 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:

esm_parser ├─ version: 5.1.12 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:

esm_plugin_manager ├─ version: 5.0.1 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:

esm_profile ├─ version: 5.0.0 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:

esm_rcfile ├─ version: 5.1.0 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:

esm_runscripts ├─ version: 5.1.31 ├─ path: /mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts ├─ branch: release └─ tags: v5.1.30-4-ge956fff

esm_tools ├─ version: 5.1.23 ├─ path: /mnt/lustre01/pf/a/a270073/esm/esm_tools ├─ branch: release └─ tags: v5.1.23-1-g1974cd7

esm_version_checker ├─ version: 5.1.5 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:

mandresm commented 3 years ago

Hi @chrisdane , all these variables belong to the section computer, not general. If you add a computer section and place the variables there that should do the job.

denizural commented 3 years ago

Hi @chrisdane, regarding the email part, that was something I already fixed before, but as @mandresm says, these variables should go under computer section instead. We can close the issue after you can verify them.

pgierz commented 3 years ago

@denizural, if it is fixed, are you sure it was also merged?

denizural commented 3 years ago

Yes, @paul that was a long time ago when Nadine found that. She saw that Slurm was not sending an e-mail. Here: https://github.com/esm-tools/esm_tools/issues/232

chrisdane commented 3 years ago

Thanks for your help

  1. In my view it does not make sense to split SBATCH arguments over different yaml sections. Why does account belong to general but mail_type (which actually is called mail-type for slurm) in the computer section? As a user I find that counter-intuitive.

  2. The runscript entry

    computer:
        mail_type: "FAIL"
        mail_user: "mail@mail.de"

    results in

    #SBATCH --mail-type=FAIL --mail-user=mail@mail.de

    i.e. in one line and not two lines in the .sad file.

  3. The runscript entry

    computer:
        partition: "compute,compute2"

    yields the error

    Traceback (most recent call last):
    File "/pf/a/a270073/.local/bin/esm_runscripts", line 11, in <module>
    load_entry_point('esm-runscripts', 'console_scripts', 'esm_runscripts')()
    File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/cli.py", line 225, in main
    Setup()
    File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/sim_objects.py", line 55, in __call__
    self.compute(*args, **kwargs)
    File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/sim_objects.py", line 89, in compute
    self.config = compute.run_job(self.config)
    File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/compute.py", line 36, in run_job
    config = evaluate(config, "compute", "compute_recipe")
    File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/helpers.py", line 101, in evaluate
    framework_recipe, framework_plugins, config
    File "/pf/a/a270073/.local/lib/python3.6/site-packages/esm_plugin_manager/esm_plugin_manager.py", line 138, in work_through_recipe
    config = getattr(submodule, workitem)(config)
    File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/batch_system.py", line 287, in write_simple_runscript
    header = batch_system.get_batch_header(config)
    File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/batch_system.py", line 66, in get_batch_header
    tasks, nodes = batch_system.calculate_requirements(config)
    File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/batch_system.py", line 135, in calculate_requirements
    cores_per_node = config['computer']['cores_per_node']
    KeyError: 'cores_per_node'

context: https://www.dkrz.de/up/systems/mistral/running-jobs/job-script-migration#section-2

chrisdane commented 3 years ago

Regarding 3): using

computer:
    partition: "compute,compute2"
    cores_per_node: 72 # multiple of 24 (compute) and 36 (compute2) tasks (=cores) per node

works and yields

#SBATCH --partition=compute,compute2
#SBATCH --ntasks=288

in the .sad file, i.e. the same batch args as the default:

#SBATCH --partition=compute
#SBATCH --ntasks=288

This also works for the awicm-1.0-recom setup (resulting --ntasks=864).

pgierz commented 3 years ago

For 3, that needs to be added to the available possible partitions in several places in the configuration yamls... https://github.com/esm-tools/esm_tools/blob/564d6874b5236ef599721ed37dd9649cbccc0dc6/configs/machines/mistral.yaml

Be aware that mixing up compute and compute2 may yield to non-reproducible results!

mandresm commented 3 years ago
  1. That can be easily included. I'll have a look at that tomorrow.

  2. Is there a technical problem with having it in the same line in the .sad file?

chrisdane commented 3 years ago
  1. No, just very unusual.
denizural commented 3 years ago

I made a detailed discussion of these issues with @chrisdane and here are our agreements:

mandresm commented 3 years ago

I am already working on your third point, modifying the slurm.yaml

mandresm commented 3 years ago

Points 1 and 3 are resolved and merge. I think point 2 is more a matter of personal preference, I prefer to have all the info about the email in one single line, but I also don't have a strong opinion about it... I don't think this needs to be changed now as there are more pressing issues going on, so @chrisdane, feel free to keep this open if you want to further discuss point 2 and we add it to wish list, or just close the issue if you are fine with it staying as it is.

chrisdane commented 3 years ago

Thanks a lot.

The changes you made only affect the mail and cores per node in case of compute,compute2. What will happen if I add/edit any of the huge number of SLURM sbatch arguments (https://slurm.schedmd.com/sbatch.html)? Where do I have to enter these? Why do I have to call it mail_type although the actual sbatch argument is called mail-type? Still not intuitive at all.

I'm afraid the real solution to this problem is neither a general nor a compute block but a slurm block or similar over which sbatch args will be passed 1:1 to the actual sbatch call. Otherwise you will have to write a case for every of these arguments? Or I misunderstand something ...

mandresm commented 3 years ago

I could change mail-type and mail-user if that's a problem, and hope people are not using already the underscore version and get annoyed that stop working...

For your general suggestion, I think that belongs to the cleanup phase, but we are not yet there. We are working on version 6.0 and I don't see a place for big changes right now. However, computer.additional_flags allows you to setup all the flags you want from slurm:

computer:
    additional_flags: "--<slurm_flag1>=<your_choice1> --<slurm_flag2>=<your_choice2>"

It might not be beautiful, but at least it's flexible, and IMO we can live with that right now.

pgierz commented 3 years ago

My 2 cents: I agree with Chris that we should replicate the slurm flag names as best we can. Since yaml allows - in variable names (Python does not, but we can convert in the background), that should not be a problem.

Not for the "To do right now" list, but, perhaps we just make an entirely new chapter for the batch system? e.g.:

slurm:
    mail-type: fail
    mail-user: whoami@whatamidoing.com
    partition: supercomputer1

that could then map directly to slurm flags. We can do similar things for PBS and MOAB (does any machine we use have MOAB?)

chrisdane commented 3 years ago

Thanks for the additional_flags Miguel, that is indeed helpful.

I would keep this open as enhancement =)