access-om2-01 hangs unpredictably - would also affect access-om2-025

aekiss commented 6 years ago

access-om2-01 hangs unpredictably for about a third of the submissions, burning 28kSU for nothing. I expect this problem to also affect access-om2-025 since this also relies on mxm.

There's been extensive email discussion with @benmenadue, @nicjhan, @marshallward under subject "HELP-8799 MPI library problem? MOM not timestepping" (https://track.nci.org.au/servicedesk/customer/portal/5/HELP-8799) and Slack https://arccss.slack.com/archives/C08KM5VRA/p1516305480000282 but I want to organise key findings here.

Symptoms: no timestepping (eg no output to work/ice/OUTPUT, which should get daily files). Eventually times out.

Problem arose on 2018-01-17 when Ben set the MXM logfile output to /dev/null for performance reasons. Working hypothesis: MXM fsyncs its log target every 50 milliseconds by default, and this fails because MXM_LOG_FILE is pointing to a non-fsync-able file (/dev/null), causing it to abort processing the asynchronous callbacks.

Workaround: use -x MXM_LOG_FILE=$PBS_JOBFS/mxm.log in mpirun in config.yaml, and submit with /projects/v45/apps/payu/aek which was modified to do shell substitution in the mpirun commands. This works most of the time but we still get unpredictable hangs.

This seems to be an oasis problem. See Bens' email of 2018-01-25 re. /short/v45/aek156/access-om2/control/01deg_jra55_ryf/work-3333791:

There's 250 ranks stuck on line 123 of mod_oasis_method.F90

call oasis_mpi_barrier(mpi_comm_global)
yet at least 5160 ranks have made it past that point and are stuck at line 231 of the same file

call MPI_COMM_SPLIT(MPI_COMM_WORLD,icolor,ikey,mpi_comm_local,mpi_err)
That should not have been able to happen, unless mpi_comm_global /= MPI_COMM_WORLD.

benmenadue commented 6 years ago

Using -x appears unreliable when that environment variable is already set in the parent environment (this is likely a bug in OpenMPI -- I'll follow up on this when I'm back at work next week).

Instead, the workaround is to unset that environment variable between loading the openmpi module and calling mpirun (and making sure to load the correct openmpi module instead of relying on the NCI-specific wrapper to mpirun doing a module swap to the correct version - and thus re-setting that environment variable).

This will cause it to hit stdout with fsync again, but AFAIK you don't redirect that to a file on a per-process basis (instead just redirecting at the mpirun level) so it shouldn't be a problem for Lustre.

We're also working with Mellanox on why MXM is doing this in the first place.

nichannah commented 6 years ago

One small thought. It appears that the current round of problems are crashes, not hangs. There is an error message here:

longjmp causes uninitialized stack frame : /short/v45/aek156/access-om2/bin/fms_ACCESS-OM_2d76b70c.x terminated

However if it is hanging after this then that may be a seperate problem we can fix to prevent wasting a lot of SU when the model crashes. e.g. a pattern thats used in fault-tolerant systems is to have a separate monitor process to check that things are progressing.

aidanheerdegen commented 6 years ago

Well spotted @nicjhan. We were just discussing at lunch what pathology we might use to detect these issues which would allow the mpirun to be killed and re-run. With enough wall time padding this might not even require a PBS resubmission.

aidanheerdegen commented 6 years ago

@marshallward was talking about implementing a monitor process in payu for other use cases, so he may well have some code sloshing about already

marshallward commented 6 years ago

Just for catching sigterms, not necessarily an auto-resubmissions, but the issue is marshallward/payu#94

russfiedler commented 6 years ago

Something like this? https://dl.acm.org/citation.cfm?id=3126938

russfiedler commented 6 years ago

Hmm.. Some things in the paper indicate that it may not be suitable for us since we have load balancing issues at certain times. Not sure if the software is freely available anyway. Maybe MATM could spawn a slave process that it communicates with occasionally. You could build up a profile of expected waiting times (or supply them) for each stage and issue an abort if things get exceeded by a large enough amount at any stage. This may be easier via payu but I know nothing about payu...

aidanheerdegen commented 6 years ago

It's a good find @russfiedler, but it seemed to me that this is something that is built into the scheduler (slurm and torque in this case)

russfiedler commented 6 years ago

@aidanheerdegen Yes, I saw that but wasn't sure how tied into the scheduler it was. I thought it may have just been the two platforms they tested on. Anyway I reckon we get away with just monitoring the progress of MATM (requires no changes to MOM or CICE using MPI3) or maybe make a communicator consisting of the root PEs of each component. Requires changes to ocean_solo.F90 and who knows where in CICE...

russfiedler commented 6 years ago

Had a little bit of free time and whipped up a crude monitoring that calls mpi_abort if a segment of code takes too long.Just requires a few calls to be added to MATM to spawn the process and tell it how long it can spend there. The test case works fine. Need to change the number of iterations in the main loop to force a failure or clean exit. It can easily be hacked to just provide stats on min/max/mean waiting times. https://github.com/russfiedler/dodgy_monitor

aidanheerdegen commented 6 years ago

Love the name of the repo @russfiedler

aekiss commented 5 years ago

The model no longer has this hanging problem. Do we want to pursue @russfiedler's dodgy_monitor idea, or should we just close this issue for now (and reopen if it happens again)?

russfiedler commented 5 years ago

Close it for the moment. As you say we can revisit it later if need be or implement it properly. I'm sure there has to be a better way than spawning a new child process off matm. Maybe put some code here https://github.com/COSIMA/libaccessom2/blob/master/atm/src/atm.F90#L139-L143

access-hive-bot commented 1 year ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/mom6-crashes-after-initialization/409/9

COSIMA / access-om2

access-om2-01 hangs unpredictably - would also affect access-om2-025 #77