GEOS-ESM / swell

Workflow system for coupled data assimilation applications
https://geos-esm.github.io/swell/
Apache License 2.0
15 stars 4 forks source link

(SE) SLES15 optimization for Github Actions #453

Open Dooruk opened 3 weeks ago

Dooruk commented 3 weeks ago

Tier 1 finished successfully but took too long, ~1 hour for almost each task(you can compare with previous Tier 1 runs). I had to do a few cylc related change in gmao_ci account to ~/bin/cylc and ~/cylc/global-workflow.yaml files, which are one time changes. Another change I had to do was ~/bin/cylc file. It chooses the correct Cylc installation depending on the OS, and ~/bin is added to $PATH.

See output here:

https://github.com/GEOS-ESM/swell/actions/runs/11633945304

Here are the steps I took to modify test_swell.yml to be able to run Test CI Applications Action:

1) Update CI-Workflows : Modify following file: GEOS-ESM/CI-workflows/.github/workflows/test_swell.yml in feature/test_swell_application branch

2) In Swell -> Actions -> Test CI Applications and run any Swell branch (say we are testing different SLURM configs). Test CI only runs a particular CI-Workflows branch linked below:

https://github.com/GEOS-ESM/swell/blob/7812c4114de3c587a026cc8c7453ad3e6b7e4528/.github/workflows/test_ci_application_discover.yml#L12

To run this you need to be in @jardizzo's nams_check.py file.

The slowdown is caused partly due to these two lines for variational tasks, since now there are 126 cores available in Milan nodes (I personally request 100 ntasks-per-node):

https://github.com/GEOS-ESM/swell/blob/7812c4114de3c587a026cc8c7453ad3e6b7e4528/src/swell/utilities/slurm.py#L49-L50

However I'm not not sure about why hofx suite would be slow, so there must be combination of Discover filesystem being slow + Swell SLES15 SLURM settings.

Default ntasks-per-node is defined here for different platforms:

https://github.com/GEOS-ESM/swell/blob/develop/src/swell/deployment/platforms/nccs_discover_sles15/slurm.yaml

@rtodling and I can help with the proper node/ntasks combination but filesystem could be resolved with $TSE_TMPDIR implementation?

Dooruk commented 2 weeks ago

This is a bit urgent as our Tier 1 or 2 tests won't work since we updated runners to SLES15.