Describe the bug
I noticed this hang in Github Actions, in the regularly hanging --config local tests. I think it is a combination of several existing issues, combined in a way that is interesting enough to document here.
Here is the event sequence I have observed in those logs.
Various tests run, in deliberately randomised order.
Eventually this test runs: parsl/tests/test_error_handling/test_serialization_fail.py and in this case stores its runinfo in .pytest/parsltest-20240109.115704-local-az5643ct/runinfo/012
This test launches a block with one htex worker, with manager ID b19dc7e14760 configured to connect to an interchange on arbitrarily chosen task port 54826 and result port 54631
On the submit side, the test rapidly completes (because it is testing a particular submit time error), well before the launched block has properly started and connected.
The test shuts down the DFK, because the test has passed successfully. Because of issue #2627 the launched block is not terminated, and the manager continues to initialize. The prevailing incorrect attitude to issue #2627 has been that blocks abandoned in this way are mostly-harmless.
Futher tests run, with more htexes, and more interchanges listening on arbitrary ports.
Eventually another test runs where the arbitrary task port is also 54826 (but this time with result port 54749. This test is in .pytest/parsltest-20240109.115704-local-az5643ct/runinfo/016.
Because this test's interchange is listening on port 54826, the manager from the 012 test connects and registers with the interchange and sends keepalives. However 016 interchange result port is not 54631 so the worker does not connect to that port for results. The interchange and manager do not notice that connection has not happened.
This half-connected behaviour might be regarded as a bug, and may be another motivation to use a single TCP connection here? I talked about this with @khk-globus a bit by separating data into multiple TCP connections that need to perform a delicate dance together, that coordination load is pushed onto the implementors of htex, rather than being handled by the implementors of TCP and zmq - see #3022
This connection of an enemy manager is documented in issue #2199
The enemy 012 manager receives the task to run, runs the task, and attempts to send the result back to port 54749. At the ZMQ layer, ZMQ is still trying to connect to that port, in the assumption that it is eventually possible and that retrying is the right thing to be doing.
The 016 interchange does not receive a result. This is generally ok behaviour: a task may take arbitrarily long to complete.
The test hangs forever waiting for a result (or at least until Github Actions times-out)
To Reproduce
Run a parsl task with only a manually launched worker, with the correct task port but incorrect result port.
Expected behavior
test completion. no incorrect worker connections. workers should be shut down properly.
Environment
github actions on a PR that is hopefully unrelated.
Describe the bug I noticed this hang in Github Actions, in the regularly hanging
--config local
tests. I think it is a combination of several existing issues, combined in a way that is interesting enough to document here.This sequence of events happens in this test run: https://github.com/Parsl/parsl/actions/runs/7460656838/artifacts/1156696809
Here is the event sequence I have observed in those logs.
Various tests run, in deliberately randomised order.
Eventually this test runs: parsl/tests/test_error_handling/test_serialization_fail.py and in this case stores its runinfo in .pytest/parsltest-20240109.115704-local-az5643ct/runinfo/012
This test launches a block with one htex worker, with manager ID
b19dc7e14760
configured to connect to an interchange on arbitrarily chosen task port 54826 and result port 54631On the submit side, the test rapidly completes (because it is testing a particular submit time error), well before the launched block has properly started and connected.
The test shuts down the DFK, because the test has passed successfully. Because of issue #2627 the launched block is not terminated, and the manager continues to initialize. The prevailing incorrect attitude to issue #2627 has been that blocks abandoned in this way are mostly-harmless.
Futher tests run, with more htexes, and more interchanges listening on arbitrary ports.
Eventually another test runs where the arbitrary task port is also 54826 (but this time with result port 54749. This test is in .pytest/parsltest-20240109.115704-local-az5643ct/runinfo/016.
Because this test's interchange is listening on port 54826, the manager from the
012
test connects and registers with the interchange and sends keepalives. However016
interchange result port is not 54631 so the worker does not connect to that port for results. The interchange and manager do not notice that connection has not happened.This half-connected behaviour might be regarded as a bug, and may be another motivation to use a single TCP connection here? I talked about this with @khk-globus a bit by separating data into multiple TCP connections that need to perform a delicate dance together, that coordination load is pushed onto the implementors of htex, rather than being handled by the implementors of TCP and zmq - see #3022
This connection of an enemy manager is documented in issue #2199
The enemy
012
manager receives the task to run, runs the task, and attempts to send the result back to port 54749. At the ZMQ layer, ZMQ is still trying to connect to that port, in the assumption that it is eventually possible and that retrying is the right thing to be doing.The
016
interchange does not receive a result. This is generally ok behaviour: a task may take arbitrarily long to complete.The test hangs forever waiting for a result (or at least until Github Actions times-out)
To Reproduce Run a parsl task with only a manually launched worker, with the correct task port but incorrect result port.
Expected behavior test completion. no incorrect worker connections. workers should be shut down properly.
Environment github actions on a PR that is hopefully unrelated.