Open Dooruk opened 3 weeks ago
This is a bit urgent as our Tier 1 or 2 tests won't work since we updated runners to SLES15.
The time between submitting a job on a head node and its start is much longer compared to the previous SLES12 runner setup. To give an example, in this action run for 3dfgat_atmos
, 20211212T0000Z/GetObservations-geos_atmosphere
task takes 6 minutes from submission to start running. Is it due to runners not being able to handle the number of requested tasks? Compute node tasks don't seem to have this issue.
3dfgat_atmos
wasn't working until I made this small fix to use more ntasks-per-node
on Milan: https://github.com/GEOS-ESM/swell/compare/develop...feature/tasks_per_node
3dvar
still fails, I have no idea why but it always fails on the same task, GenerateBClimatology
, for the same reason claiming missing files but they do exist. This task requires compute nodes. Exact same setup works on my local submission and 3dvar
suite runs on a 5-degree setup so the compute requirements are minimal. I can live with it for now while we update everything else, including the build.
~@jardizzo, could you update swell-tier1_application_discover.yml
in develop
with .github/workflows/test_swell.yml
in feature/test_swell_application
branch? That's what I've been testing with.~ This is updated now.
Tier 1 finished successfully but took too long, ~1 hour for almost each task(you can compare with previous Tier 1 runs). I had to do a few
cylc
related change ingmao_ci
account to~/bin/cylc
and~/cylc/global-workflow.yaml
files, which are one time changes. Another change I had to do was~/bin/cylc
file. It chooses the correct Cylc installation depending on the OS, and~/bin
is added to$PATH
.See output here:
https://github.com/GEOS-ESM/swell/actions/runs/11633945304
Here are the steps I took to modify
test_swell.yml
to be able to run Test CI Applications Action:1) Update CI-Workflows : Modify following file: GEOS-ESM/CI-workflows/.github/workflows/test_swell.yml in feature/test_swell_application branch
2) In Swell -> Actions -> Test CI Applications and run any Swell branch (say we are testing different SLURM configs). Test CI only runs a particular CI-Workflows branch linked below:
https://github.com/GEOS-ESM/swell/blob/7812c4114de3c587a026cc8c7453ad3e6b7e4528/.github/workflows/test_ci_application_discover.yml#L12
To run this you need to be in @jardizzo's
nams_check.py
file.The slowdown is caused partly due to these two lines for variational tasks, since now there are 126 cores available in Milan nodes (I personally request 100
ntasks-per-node
):https://github.com/GEOS-ESM/swell/blob/7812c4114de3c587a026cc8c7453ad3e6b7e4528/src/swell/utilities/slurm.py#L49-L50
However I'm not not sure about why hofx suite would be slow, so there must be combination of Discover filesystem being slow + Swell SLES15 SLURM settings.
Default
ntasks-per-node
is defined here for different platforms:https://github.com/GEOS-ESM/swell/blob/develop/src/swell/deployment/platforms/nccs_discover_sles15/slurm.yaml
@rtodling and I can help with the proper node/ntasks combination but filesystem could be resolved with
$TSE_TMPDIR
implementation?