flatironinstitute / disBatch

Tool to distribute a list of computational tasks over a pool of compute resources. The pool can grow or shrink.
Apache License 2.0
39 stars 8 forks source link

KVS receive after shutdown? #18

Closed lgarrison closed 3 years ago

lgarrison commented 4 years ago

I got the following Slurm job output at the end of a disBatch job that otherwise appeared successful:

Traceback (most recent call last):
  File "/autofs/nccs-svm1_proj/ast145/lgarrison/abacus/external/disBatch/disBatch.py", line 919, in <module>
srun: error: rhea110: task 0: Exited with exit code 1
srun: Terminating job step 36962.12
    kvs.view('.shutdown')
  File "/autofs/nccs-svm1_proj/ast145/lgarrison/abacus/external/disBatch/kvsstcp/kvsclient.py", line 188, in view
    return self._get_view(b'view', key, encoding)
  File "/autofs/nccs-svm1_proj/ast145/lgarrison/abacus/external/disBatch/kvsstcp/kvsclient.py", line 128, in _get_view
    v = self._recvValue(encoding is True and coding == b'PYPK')
  File "/autofs/nccs-svm1_proj/ast145/lgarrison/abacus/external/disBatch/kvsstcp/kvsclient.py", line 59, in _recvValue
    l = int(recvall(self.socket, AsciiLenChars))
  File "/autofs/nccs-svm1_proj/ast145/lgarrison/abacus/external/disBatch/kvsstcp/kvscommon.py", line 47, in recvall
    r = s.recv(n, socket.MSG_WAITALL)
AttributeError: 'NoneType' object has no attribute 'recv'
Some engine processes failed -- please check the logs

Looking at the disBatch status log, all the tasks appeared to have completed without error. The rhea110 engine crash appears to have occurred while the engine was shutting down. I've attached the rhea110 engine log, and the rhea111 engine log for comparison of an engine that did not crash. And the driver log too.

AbacusSummit_base_c000_ph006_rhea110_engine.log AbacusSummit_base_c000_ph006_rhea111_engine.log AbacusSummit_base_c000_ph006_driver.log

Skimming the KVS source, it looks like this could maybe occur if the KVSClient tried to do a receive after it was closed. So maybe there's a race condition of some sort? This particular job is actually one I've run a half dozen times before without error.

lgarrison commented 4 years ago

It may be significant that rhea111 has the "got shutdown" log message before "Engine shutting down", while rhea110 (which crashed) has "Engine shutting down" but not "got shutdown". The "Engine shutting down" message appears to be triggered by the SIGTERM error handler. Perhaps this signal handler is indirectly shutting down the EngineBlock, which calls kvs.close(). I believe the kvs object is shared, so that could explain the race, but I'm not totally sure where the SIGTERM is issued or if terminating all the cylinders is enough to terminate the EngineBlock.