SE: SLURM Nodes/Tasks wired in SWELL

GEOS-ESM / swell

Workflow system for coupled data assimilation applications

https://geos-esm.github.io/swell/

Apache License 2.0

14 stars 4 forks source link

SE: SLURM Nodes/Tasks wired in SWELL #308

Closed rtodling closed 5 months ago

rtodling commented 7 months ago

This is to start the discussion for a generalization of the slurm settings in SWELL. These are the issues related to this:

1) All tasks are associated w/ a single slurm option.

2) all tasks are wired to use a single node - which not enough to run 3dvar_atoms with the full observing system at the small c90 case.

3) The wiring is such that preparing a test case at other resolutions can only be done by hand.

The above opens the can of worms of the need to generalize how these options are set. The slum defaults need to be (i) task dependent; (i) model resolution dependent; (ii) analysis resolution dependent; etc.

Dooruk commented 7 months ago

Thank you for bringing this up @rtodling, I fully agree with your sentiments. The reason we assign 1 node by default is for running the CI workflow where we have low resolution model setups. For my local cycling and variational tests I change the source files and request more nodes but I understand that's not a sustainable solution.

I mentioned in some other meetings, but for the sake of discussion here, there are some upcoming changes for Discover that will add another level of complexity:

Milan nodes (SCU17/18) have 128 cores as opposed to 48 on Cascades
SCU16 uses SLES12, while SCU17/18 is SLES15
(@mathomp4 told me this one) Cascades will also be updated to SLES15

I built and tested JEDI with Milan nodes but there is still work needs to be done to build Swell dependencies and Swell. In my opinion, and to avoid some of the complexities, we should only rely on Milan nodes.

As you mentioned, there are different combinations of model, task, and analysis resolutions. I envision supporting low (CI tests), medium (lower node requirement), high (operational/reanalysis) settings. We would still have defaults for most low resolution cases but be explicit for the others. These would reside within the platform settings. The other critical allocation consideration is the ensemble tasks.

I am open for other suggestions and would like this to be an ongoing discussion. I also would like to see how this is done in the current DAS system? Are there multiple configurations supported and are scripts where node assignments are explicitly defined?

ashiklom commented 6 months ago

My recommendation:

As a first pass, per #325, we create a way to pass arbitrary SLURM options through the YAML workflow configs, with the existing hard-wired configurations as defaults (to keep the existing workflows working without modification). Trust the users to pick SLURM options appropriate to their workflows.
Down the line, if useful and warranted, we can try to find ways to "package" / alias common sets of SLURM configs for certain kinds of workflows.

jardizzo commented 6 months ago

Hi All, sorry to be so late commenting. Alexey's plan is the most logical one. However, Dan shied away from that approach. I believe he deferred batch job parameters to Jinja templates. The template file was usually deposited in the experiment directory; though this may have evolved since my involvement. The CYLC suite files must be constructed to specify all SLURM directives in order to accommodate all possible batch constraints and options. Furthermore, there are additional considerations for SWELL as a workflow component for CI and individual users that do not have priority queue access etc.. That will certainly require customized settings. CYLC accepts global defaults as configured in each user's home directory under .cylc. That becomes a form of data hiding and is probably not the best approach either. I'll try to review some more before making additional comments. In the end, we want a generalized approach that will convey to different workflow engines in case we move away from CYLC. As it stands now, SWELL is bundling CYLC suites, which means we are married to CYLC until divorced!!

Dooruk commented 6 months ago

Alexey's first suggestion is what I had in mind in terms of moving SLURM config away from swell_create_experiment.py (the codeblock he shared in #325) to a YAML. I think the logical place would be under deployment/platforms/nccs_discover_* since Milan nodes and Cascade/Skylake have different core numbers per node. I tested in a local branch to have nccs_discover and nccs_discover_SLES15 (to handle different spack-stack modules and OS infrastructure). SLES15 will be the Discover default anyhow.

To repeat what I mentioned above, I suggest, under platforms, having three default YAML configurations (low, medium, high) for two model types (for now ocean, atm, as the layout structures differ). Perhaps we could also support a custom option for user to test & pick if they prefer their own configuration.

@ashiklom to respond to your other issue comment, [[directives]] is what we are using in flow.cylc files, please see:

https://github.com/GEOS-ESM/swell/blob/cebcb49b4008485bff0456ec346692bc9cea6367/src/swell/suites/3dfgat_atmos/flow.cylc#L154-L188

Is there a consideration to move away from CYLC and/or extend support for another workflow engine?

ashiklom commented 6 months ago

I propose the following hierarchy (items at the top of this list override items below them):

SLURM flags entered directly through questionary (requires a new set of questionary questions about SLURM)
SLURM flags passed via environment variables (new feature)
SLURM flags passed via a local .yaml (or something) config file, e.g., in the current directory (new feature)
SLURM flags passed via the global user config file ~/.swell/swell-slurm.yaml (already exists)
Hard-coded fallback defaults (already exists)

I think this is a common pattern for this kind of thing. The implementation is straightforward. If we store all of these as shallow Python dicts (not more than 1 level deep), we can just merge them in reverse order and the overwrites should be applied as expected.

{**global_defaults, **user_config, **local_config, **env_var_config, **questionary_config}

If the dicts are deep, this is a bit more complicated, but not that hard.

ashiklom commented 6 months ago

@Dooruk My proposal would be to refactor this. Instead of doing everything via manual jinja templating in the .cylc files, I propose we leave the entire [[directives]] block as one large template that gets populated at runtime by SWELL. Then, we can migrate the SLURM logic into the SWELL Python code.

In the above example, something like:

 [[RunJediVariationalExecutable-{{model_component}}]] 
     script = "swell task RunJediVariationalExecutable $config -d $datetime -m {{model_component}}" 
     platform = {{platform}} 
     execution time limit = {{scheduling["RunJediVariationalExecutable"]["execution_time_limit"]}} 
     [[[directives]]]
     {% for item in scheduling["RunJediVariationalExecutable"]["slurm_directives"]}
        --{{ item.key }}={{ item.value }}
     {% endfor}

...such that:

scheduling["RunJediVariationalExecutable"]["slurm_directives"] = {
  "account": "s1234",
  "job-name": "myjob",
  #etc...
}

...which would ultimately expand to:

  [[[directives]]]
    --account=s1234
    --job-name=myjob
    # etc...

jardizzo commented 6 months ago

I think SWELL already uses a recursive dictionary merge so implementing a heuristic SLURM directives paradigm should be easy. It might be worth asking Dan why he did not want to do this. It was certainly discussed. I proposed years ago that YAML could actually be used as a workflow language and then translated into the workflow engine of choice. It was overhead that we didn't need at the time. EWOK (the first attempt at solving the SWELL problem) was designed using ECFLOW. That might be the most likely competitor for future workflow engines.

Dooruk commented 6 months ago

I propose the following hierarchy (items at the top of this list override items below them):

1. SLURM flags entered directly through questionary (requires a **new** set of questionary questions about SLURM)
2. SLURM flags passed via environment variables (**new** feature)
3. SLURM flags passed via a _local_ `.yaml` (or something) config file, e.g., in the current directory (**new** feature)
4. SLURM flags passed via the _global user config file_ `~/.swell/swell-slurm.yaml` (already exists)
5. Hard-coded fallback defaults (already exists)

I'm always a bit iffy about using environment variables, but overall the hierarchy sounds good.

@Dooruk My proposal would be to refactor this.

This sounds great. Happy to guide/help through the implementation of these features!

ashiklom commented 6 months ago

Now that I've had a chance to wrap my head around the SWELL codebase and workflow a bit more, I have a better sense of how this could work.

I propose the following here:

Task-specific SLURM flags from questionary or overrides YAML, which override...
Global SLURM flags from questionary or overrides YAML, which override...
SLURM flags set in global user config file (~/.swell/swell-slurm.yaml), which override...
Hard-coded defaults in the SWELL source code.

I can take a first pass at this.

For (1) in the list above, we could define this in two different ways:

By SWELL task ID (e.g., BuildGEOS, RunJediHofxExecutable-geos_atmosphere). This maximizes flexibility, but is more brittle (if SWELL task names and roles evolve over time) and can be more of a pain to type.

slurm_directives_globals:
  account: g0613
  qos: allnccs
  nodes: 1
slurm_directives_tasks:
  BuildGEOS:
    nodes: 2
    ntasks_per_node: 4
  RunJediHofxExecutable-geos_atmosphere:
    nodes: 8
    ntasks_per_node: 8
  <...>

...or

By model component (or some other SWELL-specific abstraction) as defined in the experiment.yaml (e.g., geos_atmosphere, geos_ocean). This is a bit more opaque but may be a bit more future-proof as SWELL task IDs evolve.

slurm_directives_globals:
  account: g0613
  qos: allnccs
  nodes: 1
slurm_directives_models:
  geos_atmosphere:
    nodes: 8
    tasks_per_node: 4
  geos_ocean:
    nodes: 4
    ntasks_per_node: 4

(or perhaps a more granular version of this?)

@rtodling What do you think? Could you give some examples of SLURM customizations or your existing SWELL (or other GEOS) workflows? That will help me understand how much flexibility to bake into the system and where to bake it in.

rtodling commented 6 months ago

Ah! I like this. This is nice and should work just ... I also like to see that the account number can be accommodate into this. This is another one of the things that have been in my mind ... that is, that swell should provide a flexible way for users to specify different account numbers to run different experiments ... The way it is above it seems the account number specification can be as refined as tasks within a suite ... in my mind this does have to be as refined, but it is important to allow for different suites to use different account numbers ... I have work in the ADAS and that uses a given account number; I have work in different proposals each w/ their specific account number. We would want be able to allow flexibility in that. Other users are no different than I in the number of account numbers they have in association w/ different efforts.

But what's presented and discussed here is very nice.

ashiklom commented 6 months ago

Thanks Ricardo!

allow for different suites to use different account numbers

This already works. For example:

slurm_directives_globals:
  account: g0613  # <-- Default value for tasks if not specified
  qos: allnccs
  nodes: 1
slurm_directives_tasks:
  BuildGEOS:
    # This will use account: g0613 because it's inherited from the globals
    nodes: 2
    ntasks_per_node: 4
  RunGEOSExecutable:
    account: s1414  # This will OVERRIDE the default
    nodes: 8
    ntasks_per_node: 8

You can take a look at the implementation here: https://github.com/ashiklom/swell/commit/c5d45dfc9593e2f7648de579b83df02a2f7b4087

The outstanding question is: How to handle the interaction between tasks (e.g., RunGEOSExecutable) and models (e.g., geos_atmosphere, geos_ocean)? Right now, even in my initial implementation, every RunGEOSExecutable will use the same SLURM config; there's no way to pass different SLURM configs between RunGEOSExecutable-geos_ocean and RunGEOSExecutiable-geos_atmosphere. Maybe this isn't actually a problem, but my guess is that geos_atmosphere and geos_ocean may want/need different SLURM configurations, in which case this will need to be solved.

Let's discuss at the SWELL tag-up this afternoon.