Closed lgarrison closed 3 years ago
It may be significant that rhea111 has the "got shutdown" log message before "Engine shutting down", while rhea110 (which crashed) has "Engine shutting down" but not "got shutdown". The "Engine shutting down" message appears to be triggered by the SIGTERM error handler. Perhaps this signal handler is indirectly shutting down the EngineBlock
, which calls kvs.close()
. I believe the kvs
object is shared, so that could explain the race, but I'm not totally sure where the SIGTERM is issued or if terminating all the cylinders is enough to terminate the EngineBlock
.
I got the following Slurm job output at the end of a disBatch job that otherwise appeared successful:
Looking at the disBatch status log, all the tasks appeared to have completed without error. The rhea110 engine crash appears to have occurred while the engine was shutting down. I've attached the rhea110 engine log, and the rhea111 engine log for comparison of an engine that did not crash. And the driver log too.
AbacusSummit_base_c000_ph006_rhea110_engine.log AbacusSummit_base_c000_ph006_rhea111_engine.log AbacusSummit_base_c000_ph006_driver.log
Skimming the KVS source, it looks like this could maybe occur if the
KVSClient
tried to do a receive after it was closed. So maybe there's a race condition of some sort? This particular job is actually one I've run a half dozen times before without error.