Parsl / parsl

Parsl - a Python parallel scripting library
http://parsl-project.org
Apache License 2.0
463 stars 189 forks source link

HTEX submit-side should notice interchange has gone away rather than hanging #3374

Open benclifford opened 1 month ago

benclifford commented 1 month ago

Describe the bug In some situations, the interchange will go away. Commonly these are: OOM killer, user pressing ctrl-C, developers hacking on the interchange and breaking it.

In this situation, the submit side sits waiting for the interchange to reappear on ZMQ channels, but neither detects it's gone nor attempts any repairing action.

This kind of failure means that high throughput executor can not continue to work, and the submit side should act accordingly, rather than hanging

To Reproduce kill the interchange midway through a test run