HTEX manager<->interchange race condition heartbeat timeout vs heartbeat arrival

benclifford commented 6 months ago

Describe the bug I've seen the following occur on a (hopefully unrelated) branch of 2024.03.18 dc521d0c4bb9dde02d64efb79952bb0a4d2f3566 under high load:

A manager registers:

2024-03-19 11:55:14.709 interchange:451 HTEX-Interchange(56) MainThread process_task_outgoing_incoming [INFO] Registration info for manager b'a82cccafaeac': {'type': 'registration', 'parsl_v': '1.3.0-dev', 'python_v': '3.12.2', 'worker_count': 1, 'uid': 'a82cccafaeac', 'block_id': '12', 'prefetch_capacity': 0, 'max_capacity': 1, 'os': 'Linux', 'hostname': 'parsl-dev-3-12-4981', 'dir': '/home/benc/parsl/src/parsl', 'cpu_count': 4, 'total_memory': 16467464192}

Because of high system load:

2024-03-19 11:57:20.380 interchange:613 HTEX-Interchange(56) MainThread expire_bad_managers [WARNING] Too many heartbeats missed for manager b'a82cccafaeac' - removing manager

but then a heartbeat from that manager does arrive, which the interchange cannot handle:

2024-03-19 11:57:20.500 interchange:31 HTEX-Interchange(56) MainThread wrapped [
ERROR] Exceptional ending for starter on thread MainThread
Traceback (most recent call last):
  File "/home/benc/parsl/src/parsl/parsl/process_loggers.py", line 27, in wrappe
d
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/benc/parsl/src/parsl/parsl/executors/high_throughput/interchange.p
y", line 680, in starter
    ic.start()
  File "/home/benc/parsl/src/parsl/parsl/process_loggers.py", line 27, in wrappe
d
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/benc/parsl/src/parsl/parsl/executors/high_throughput/interchange.p
y", line 387, in start
    self.process_task_outgoing_incoming(interesting_managers, hub_channel, kill_
event)
  File "/home/benc/parsl/src/parsl/parsl/executors/high_throughput/interchange.p
y", line 473, in process_task_outgoing_incoming
    self._ready_managers[manager_id]['last_heartbeat'] = time.time()
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: b'a82cccafaeac'

at which point the main thread of the interchange is killed.

Expected behavior The htex heartbeat handling code needs to cope with this race condition.

Environment my dev environment, branch from above named parsl version

EricLee543 commented 6 months ago

The direct cause is that the Interchange receives a heartbeat from Manager after running expire_bad_managers. This situation is not common. I tried to reproduce this bug - using the following method. Set heartbeat_period = 30s, heartbeat_threshold = 31s, and introduce a 3s delay when the manager sends a heartbeat. This can reproduce the exception.

I think we can modify the line 473 a little bit. For example,

manager_info = self._ready_managers.get(manager_id)
if manager_info is not None:
    manager_info['last_heartbeat'] = time.time()
    self.task_outgoing.send_multipart([manager_id, b'', PKL_HEARTBEAT_CODE])
    logger.debug("Manager {!r} sent heartbeat via tasks connection".format(manager_id))
else:
    logger.warning("Manager {!r} has been expired.".format(manager_id))

The manger will finally exit due to missing the contact with interchange.

If you think it is a good solution, i would like to raise a PR to fix this.

benclifford commented 6 months ago

@yadudoc @khk-globus @rjmello might be interested in this proposed fix - superficially it looks ok, but I won't have time for the next week to think about this properly.

Parsl / parsl

HTEX manager<->interchange race condition heartbeat timeout vs heartbeat arrival #3262