NorESMhub / NorESM

Norwegian Earth System Model and Documentation
https://noresm-docs.readthedocs.io/en/latest/
Other
35 stars 73 forks source link

Betzy problem when running on 2 nodes with noresm2.5 #594

Open mvertens opened 1 day ago

mvertens commented 1 day ago

I have now encountered the same issue when running I compsets and F compsets on 2 nodes. Errors like the following appear in the cesm.log file

134: [b3296.betzy.sigma2.no:1104993] pml_ucx.c:911 Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer 134: [b3296:1104993] An error occurred in MPI_Send 134: [b3296:1104993] reported by process [23299297509376,134] 134: [b3296:1104993] on communicator MPI COMMUNICATOR 21 CREATE FROM 20 134: [b3296:1104993] MPI_ERR_OTHER: known error not in list 134: [b3296:1104993] MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, 134: [b3296:1104993] and potentially your MPI job)

The solution seems to be to increase the number of nodes to 4 - and then everything works. I am writing to sigma2 to raise this issue as well.

TomasTorsvik commented 13 hours ago

@mvertens - Hi, Betzy has a minimum node number of 4 for jobs on the "normal" queue. "devel" jobs can run on 1-4 nodes for a short time (up to 60 min). It seems this requirement is there to encourage moving smaller jobs to Fram, so that these do not fill up the queue on Betzy.

See job types description here: https://documentation.sigma2.no/jobs/job_types/betzy_job_types.html

gold2718 commented 12 hours ago

The NorESM configuration for Betzy currently only sends jobs with 4 or more nodes to the normal queue. The devel queue is marked with a minimum of 1 and a maximum of 4 nodes and the preproc queue has no restrictions. See <ccs_config>/machines/betzy/config_batch.xml.

@mvertens, which queue was used for your job? Even if it was devel (which I think should have worked), I think we should restrict preproc to 1 node as that is a the Betzy limit.

JensBDebernard commented 11 hours ago

Thanks Mariana. I have experienced the same error with the MakingWave code during the last week. wave-ocean-ice with data atmos was working find on 2 nodes (develop queue) from the start of November. But suddenly, around November 13-15th something has changed so none of these compsets (or perturbations thereof are running) with the same pe-layout. I will try increasing the number of nodes, although it is more costly in the debug-phase.

mvertens commented 11 hours ago

@TomasTorsvik @gold2718 @JensBDebernard - I have double checked and the queue is devel. This also worked for me up until around the 15th and suddenly stopped working. I have raised an issue with sigma2.

gold2718 commented 11 hours ago

Should we take the hint and set up a test suite of smaller tests on Fram? It would just mean firing off and then checking two test runs instead of one.

mvertens commented 11 hours ago

I got a response from sigma2 that they have escalated this ticket to their second line support, and they'll follow up shortly.

TomasTorsvik commented 11 hours ago

Our quota on Fram is quite limited, only 150K CPU hours on nn2345k. We could ask for an increased quota, but it would probably be "non-prioritized" for the current allocation period.