E3SM-Project / polaris

Testing and analysis for OMEGA, MPAS-Ocean, MALI and MPAS-Seaice
BSD 3-Clause "New" or "Revised" License
6 stars 13 forks source link

Check available parallel resources only once #76

Closed altheaden closed 1 year ago

altheaden commented 1 year ago

This is a port from changes made in Compass: https://github.com/MPAS-Dev/compass/pull/538

This merge changes run.serial and parallel so we only poll for slrum resources once and use that information for all test cases and steps.

This merge also removes the error message when you run on a login node. Instead, a login node is allocated a small number of cores (4 by default) and isn't allowed to run MPI jobs (it only supports one task).

To accommodate this, the Step.constrain_resources() method takes a dictionary of available resources instead of just the number of available cores.

Checklist

xylar commented 1 year ago

@xylar will add API docs.

xylar commented 1 year ago

Testing

I ran the cosine_bell test case with this branch on Chrysalis, using main as a baseline and all tests passed. I ran the cosine_bell/Icos60/viz step from a login node and it was successful. I ran cosine_bell/Icos60/forward on a login node and it failed as expected:

ValueError: You are trying to run an MPI job on a login node.
Please switch to a compute node.
xylar commented 1 year ago

@altheaden, this just needs a rebase and some small conflicts fixed (between this and #75), and then I think it will be ready to merge.

altheaden commented 1 year ago

@xylar I successfully rebased this branch to main, and fixed the noqa comment. Please double check how I handled the available_resources dictionary, I believe this is the way we want it to be set up but I want to make sure. Thanks!

xylar commented 1 year ago

@altheaden, there's a bad string (one of those VS Code suggestions from yesterday) here: https://github.com/E3SM-Project/polaris/blob/main/polaris/run/serial.py#L113-L114

Can you fix that in this PR even though it's unrelated?

altheaden commented 1 year ago

@xylar Fixed!

xylar commented 1 year ago

Hmm, now we're missing a space:

00:08PASS ocean/baroclinic_channel/10km/threads_test
00:04PASS ocean/baroclinic_channel/10km/decomp_test
00:04PASS ocean/baroclinic_channel/10km/restart_test
00:02PASS ocean/single_column/960km/cvmix
xylar commented 1 year ago

@altheaden, we're almost there...

altheaden commented 1 year ago

@xylar Sorry to keep making you test!!! Should be good now.... fingers crossed.

xylar commented 1 year ago

Looks great this time, thanks!