Closed aekiss closed 5 years ago
Using -x
appears unreliable when that environment variable is already set in the parent environment (this is likely a bug in OpenMPI -- I'll follow up on this when I'm back at work next week).
Instead, the workaround is to unset that environment variable between loading the openmpi
module and calling mpirun
(and making sure to load the correct openmpi
module instead of relying on the NCI-specific wrapper to mpirun
doing a module swap
to the correct version - and thus re-setting that environment variable).
This will cause it to hit stdout with fsync
again, but AFAIK you don't redirect that to a file on a per-process basis (instead just redirecting at the mpirun
level) so it shouldn't be a problem for Lustre.
We're also working with Mellanox on why MXM is doing this in the first place.
One small thought. It appears that the current round of problems are crashes, not hangs. There is an error message here:
longjmp causes uninitialized stack frame : /short/v45/aek156/access-om2/bin/fms_ACCESS-OM_2d76b70c.x terminated
However if it is hanging after this then that may be a seperate problem we can fix to prevent wasting a lot of SU when the model crashes. e.g. a pattern thats used in fault-tolerant systems is to have a separate monitor process to check that things are progressing.
Well spotted @nicjhan. We were just discussing at lunch what pathology we might use to detect these issues which would allow the mpirun to be killed and re-run. With enough wall time padding this might not even require a PBS resubmission.
@marshallward was talking about implementing a monitor process in payu for other use cases, so he may well have some code sloshing about already
Just for catching sigterms, not necessarily an auto-resubmissions, but the issue is marshallward/payu#94
Something like this? https://dl.acm.org/citation.cfm?id=3126938
Hmm.. Some things in the paper indicate that it may not be suitable for us since we have load balancing issues at certain times. Not sure if the software is freely available anyway. Maybe MATM could spawn a slave process that it communicates with occasionally. You could build up a profile of expected waiting times (or supply them) for each stage and issue an abort if things get exceeded by a large enough amount at any stage. This may be easier via payu but I know nothing about payu...
It's a good find @russfiedler, but it seemed to me that this is something that is built into the scheduler (slurm and torque in this case)
@aidanheerdegen Yes, I saw that but wasn't sure how tied into the scheduler it was. I thought it may have just been the two platforms they tested on. Anyway I reckon we get away with just monitoring the progress of MATM (requires no changes to MOM or CICE using MPI3) or maybe make a communicator consisting of the root PEs of each component. Requires changes to ocean_solo.F90 and who knows where in CICE...
Had a little bit of free time and whipped up a crude monitoring that calls mpi_abort if a segment of code takes too long.Just requires a few calls to be added to MATM to spawn the process and tell it how long it can spend there. The test case works fine. Need to change the number of iterations in the main loop to force a failure or clean exit. It can easily be hacked to just provide stats on min/max/mean waiting times. https://github.com/russfiedler/dodgy_monitor
Love the name of the repo @russfiedler
The model no longer has this hanging problem. Do we want to pursue @russfiedler's dodgy_monitor
idea, or should we just close this issue for now (and reopen if it happens again)?
Close it for the moment. As you say we can revisit it later if need be or implement it properly. I'm sure there has to be a better way than spawning a new child process off matm. Maybe put some code here https://github.com/COSIMA/libaccessom2/blob/master/atm/src/atm.F90#L139-L143
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
https://forum.access-hive.org.au/t/mom6-crashes-after-initialization/409/9
access-om2-01 hangs unpredictably for about a third of the submissions, burning 28kSU for nothing. I expect this problem to also affect access-om2-025 since this also relies on mxm.
There's been extensive email discussion with @benmenadue, @nicjhan, @marshallward under subject "HELP-8799 MPI library problem? MOM not timestepping" (https://track.nci.org.au/servicedesk/customer/portal/5/HELP-8799) and Slack https://arccss.slack.com/archives/C08KM5VRA/p1516305480000282 but I want to organise key findings here.
Symptoms: no timestepping (eg no output to
work/ice/OUTPUT
, which should get daily files). Eventually times out.Problem arose on 2018-01-17 when Ben set the MXM logfile output to /dev/null for performance reasons. Working hypothesis: MXM fsyncs its log target every 50 milliseconds by default, and this fails because MXM_LOG_FILE is pointing to a non-fsync-able file (/dev/null), causing it to abort processing the asynchronous callbacks.
Workaround: use
-x MXM_LOG_FILE=$PBS_JOBFS/mxm.log
in mpirun in config.yaml, and submit with/projects/v45/apps/payu/aek
which was modified to do shell substitution in the mpirun commands. This works most of the time but we still get unpredictable hangs.This seems to be an oasis problem. See Bens' email of 2018-01-25 re.
/short/v45/aek156/access-om2/control/01deg_jra55_ryf/work-3333791
: