Closed bertinia closed 5 years ago
CISL made a change to the underlying ncarenv module links so this issue should be fixed.
New issue related to MPT launch error is now a problem. From /glade/work/aliceb/sandboxes/runs/b.e21.BHIST.f09_g17.CMIP6-historical.011/atm_averages:
[aliceb@cheyenne4:logs]>cat atm_averages.log.20190708-105129
MPT: Launcher network accept (MPI_LAUNCH_TIMEOUT) timed out
MPT: Launcher on r3i2n24 failed to receive connection(s) from: r3i2n24.ib0.cheyenne.ucar.edu r3i2n25.ib0.cheyenne.ucar.edu r3i2n26.ib0.cheyenne.ucar.edu r3i2n34.ib0.cheyenne.ucar.edu
MPT ERROR: could not launch executable
(HPE MPT 2.19 02/23/19 05:31:12)
The MPT launch error was because there were some partial dirs/files in place. The atm_avg_generator.py should have been able to detect this and exit gracefully. Once I deleted the dirs and files in question and resubmitted, it ran fine.
However, we have a new issue with the diagnostics related to some instabilities in the system. Here's what CISL says about this latest issue on cheyenne:
Okay, so I suspect that the job issue is tied to node instability we are currently experiencing. Apparently there is an issue with the numa libraries in the OS image from HPE, and this issue is causing batch nodes to crash and reboot at a much higher rate than normal. We are trying to get HPE's immediate focus on this problem, as it will probably hit any jobs that use MPT - and it may hit jobs that use other MPIs as well. This is a new failure mode we are seeing... I could move MPI4Py to a different MPI (probably Open MPI since we are using GCC to build), but the issue will likely still hit the CESM runs assuming they are using MPT. Let me know if you want to try an Open MPI version though if you wish to test out its stability.
The reason that jobs run multiple times with the same job ID is that if a node reboot is detected, PBS will often try to rerun the job, as it assumes that a hardware fault is the root issue. But if it is a software issue, the same job can end up causing multiple nodes to reboot in an unfortunately cycle. If I look at job 7030831 with qstat -f, I can see that the run_count is currently at 27 and may keep rising until the software issue is resolved or the job is manually killed.
fixed in PR #210
The system modules and ncar_pylib (v 20190627) updates on cheyenne cause the following PyNio error: