Open benclifford opened 6 months ago
The direct cause is that the Interchange
receives a heartbeat from Manager
after running expire_bad_managers
. This situation is not common.
I tried to reproduce this bug - using the following method. Set heartbeat_period = 30s, heartbeat_threshold = 31s, and introduce a 3s delay when the manager sends a heartbeat. This can reproduce the exception.
I think we can modify the line 473 a little bit. For example,
manager_info = self._ready_managers.get(manager_id)
if manager_info is not None:
manager_info['last_heartbeat'] = time.time()
self.task_outgoing.send_multipart([manager_id, b'', PKL_HEARTBEAT_CODE])
logger.debug("Manager {!r} sent heartbeat via tasks connection".format(manager_id))
else:
logger.warning("Manager {!r} has been expired.".format(manager_id))
The manger will finally exit due to missing the contact with interchange.
If you think it is a good solution, i would like to raise a PR to fix this.
@yadudoc @khk-globus @rjmello might be interested in this proposed fix - superficially it looks ok, but I won't have time for the next week to think about this properly.
Describe the bug I've seen the following occur on a (hopefully unrelated) branch of 2024.03.18 dc521d0c4bb9dde02d64efb79952bb0a4d2f3566 under high load:
A manager registers:
Because of high system load:
but then a heartbeat from that manager does arrive, which the interchange cannot handle:
at which point the main thread of the interchange is killed.
Expected behavior The htex heartbeat handling code needs to cope with this race condition.
Environment my dev environment, branch from above named parsl version