Open chrisdane opened 3 years ago
Hi @chrisdane , all these variables belong to the section computer
, not general
. If you add a computer
section and place the variables there that should do the job.
Hi @chrisdane, regarding the email part, that was something I already fixed before, but as @mandresm says, these variables should go under computer
section instead. We can close the issue after you can verify them.
@denizural, if it is fixed, are you sure it was also merged?
Yes, @paul that was a long time ago when Nadine found that. She saw that Slurm was not sending an e-mail. Here: https://github.com/esm-tools/esm_tools/issues/232
Thanks for your help
In my view it does not make sense to split SBATCH arguments over different yaml sections. Why does account
belong to general
but mail_type
(which actually is called mail-type
for slurm) in the computer
section? As a user I find that counter-intuitive.
The runscript entry
computer:
mail_type: "FAIL"
mail_user: "mail@mail.de"
results in
#SBATCH --mail-type=FAIL --mail-user=mail@mail.de
i.e. in one line and not two lines in the .sad file.
The runscript entry
computer:
partition: "compute,compute2"
yields the error
Traceback (most recent call last):
File "/pf/a/a270073/.local/bin/esm_runscripts", line 11, in <module>
load_entry_point('esm-runscripts', 'console_scripts', 'esm_runscripts')()
File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/cli.py", line 225, in main
Setup()
File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/sim_objects.py", line 55, in __call__
self.compute(*args, **kwargs)
File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/sim_objects.py", line 89, in compute
self.config = compute.run_job(self.config)
File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/compute.py", line 36, in run_job
config = evaluate(config, "compute", "compute_recipe")
File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/helpers.py", line 101, in evaluate
framework_recipe, framework_plugins, config
File "/pf/a/a270073/.local/lib/python3.6/site-packages/esm_plugin_manager/esm_plugin_manager.py", line 138, in work_through_recipe
config = getattr(submodule, workitem)(config)
File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/batch_system.py", line 287, in write_simple_runscript
header = batch_system.get_batch_header(config)
File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/batch_system.py", line 66, in get_batch_header
tasks, nodes = batch_system.calculate_requirements(config)
File "/mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts/esm_runscripts/batch_system.py", line 135, in calculate_requirements
cores_per_node = config['computer']['cores_per_node']
KeyError: 'cores_per_node'
context: https://www.dkrz.de/up/systems/mistral/running-jobs/job-script-migration#section-2
Regarding 3): using
computer:
partition: "compute,compute2"
cores_per_node: 72 # multiple of 24 (compute) and 36 (compute2) tasks (=cores) per node
works and yields
#SBATCH --partition=compute,compute2
#SBATCH --ntasks=288
in the .sad file, i.e. the same batch args as the default:
#SBATCH --partition=compute
#SBATCH --ntasks=288
This also works for the awicm-1.0-recom
setup (resulting --ntasks=864
).
For 3, that needs to be added to the available possible partitions in several places in the configuration yamls... https://github.com/esm-tools/esm_tools/blob/564d6874b5236ef599721ed37dd9649cbccc0dc6/configs/machines/mistral.yaml
Be aware that mixing up compute and compute2 may yield to non-reproducible results!
That can be easily included. I'll have a look at that tomorrow.
Is there a technical problem with having it in the same line in the .sad file?
I made a detailed discussion of these issues with @chrisdane and here are our agreements:
computer
and some going to general
is confusing like in the case of partition
. Users can't memorize what should go where. This needs to be addressed. common_general_computer
variables can be provided and if they appear in the general
section of the runscript they will be written to the computer
section in the final config. Another option is fixing it in the code and making a hotfix (including the sample runscripts provided)I am already working on your third point, modifying the slurm.yaml
Points 1 and 3 are resolved and merge. I think point 2 is more a matter of personal preference, I prefer to have all the info about the email in one single line, but I also don't have a strong opinion about it... I don't think this needs to be changed now as there are more pressing issues going on, so @chrisdane, feel free to keep this open if you want to further discuss point 2 and we add it to wish list, or just close the issue if you are fine with it staying as it is.
Thanks a lot.
The changes you made only affect the mail and cores per node in case of compute,compute2. What will happen if I add/edit any of the huge number of SLURM sbatch arguments (https://slurm.schedmd.com/sbatch.html)? Where do I have to enter these? Why do I have to call it mail_type although the actual sbatch argument is called mail-type? Still not intuitive at all.
I'm afraid the real solution to this problem is neither a general
nor a compute
block but a slurm
block or similar over which sbatch args will be passed 1:1 to the actual sbatch call. Otherwise you will have to write a case for every of these arguments? Or I misunderstand something ...
I could change mail-type
and mail-user
if that's a problem, and hope people are not using already the underscore version and get annoyed that stop working...
For your general suggestion, I think that belongs to the cleanup phase, but we are not yet there. We are working on version 6.0 and I don't see a place for big changes right now. However, computer.additional_flags
allows you to setup all the flags you want from slurm:
computer:
additional_flags: "--<slurm_flag1>=<your_choice1> --<slurm_flag2>=<your_choice2>"
It might not be beautiful, but at least it's flexible, and IMO we can live with that right now.
My 2 cents: I agree with Chris that we should replicate the slurm flag names as best we can. Since yaml allows -
in variable names (Python does not, but we can convert in the background), that should not be a problem.
Not for the "To do right now" list, but, perhaps we just make an entirely new chapter for the batch system? e.g.:
slurm:
mail-type: fail
mail-user: whoami@whatamidoing.com
partition: supercomputer1
that could then map directly to slurm flags. We can do similar things for PBS and MOAB (does any machine we use have MOAB?)
Thanks for the additional_flags Miguel, that is indeed helpful.
I would keep this open as enhancement =)
Describe the bug The esm_tools on release ignore the runscript entries
The resulting .sad file has the mistral default values:
See runscript and .sad file:
To Reproduce Steps to reproduce the behavior:
Expected behavior User parameters should always win if valid.
System (please complete the following information):
esm_versions check
]Click to expand
esm_analysis ├─ version: 0.3.0 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:
esm_archiving ├─ version: 5.0.0 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:
esm_calendar ├─ version: 5.0.0 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:
esm_database ├─ version: 5.0.0 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:
esm_environment ├─ version: 5.1.3 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:
esm_master ├─ version: 5.1.6 ├─ path: /mnt/lustre01/pf/a/a270073/esm/src/esm_master ├─ branch: develop └─ tags: v5.1.6-10-g96c2b66
esm_motd ├─ version: 5.0.2 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:
esm_parser ├─ version: 5.1.12 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:
esm_plugin_manager ├─ version: 5.0.1 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:
esm_profile ├─ version: 5.0.0 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:
esm_rcfile ├─ version: 5.1.0 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags:
esm_runscripts ├─ version: 5.1.31 ├─ path: /mnt/lustre01/pf/a/a270073/esm/src/esm_runscripts ├─ branch: release └─ tags: v5.1.30-4-ge956fff
esm_tools ├─ version: 5.1.23 ├─ path: /mnt/lustre01/pf/a/a270073/esm/esm_tools ├─ branch: release └─ tags: v5.1.23-1-g1974cd7
esm_version_checker ├─ version: 5.1.5 ├─ path: /mnt/lustre01/pf/a/a270073/.local/lib/python3.6/site-packages ├─ branch: └─ tags: