Open benclifford opened 6 months ago
As a result of adding the ManagerSelector feature to the HTEX in #3547, should the ManagerSelector fail to properly sort the manager list and return it to the interchange, the interchange will crash but will leave the rest of the program hanging. Documenting this behavior for future reference if more complex ManagerSelectors are introduced which could increase the chance of breakage.
Many parts of Parsl and especially htex will result in end-user hangs because bad behaviour is "ignored" rather than "propagated towards the user". See #3404 and #3427. This comes from the original prototype-oriented implementation in which it is desirable to keep doing as much as possible rather than failing early on prototype-quality coding errors, giving as much work done for reporting the happy path, rather than providing a good user experience.
Six years after prototyping, and with a much more production-oriented user base, reporting errors to the user and failing cleanly rather than hanging are much more important. I would like to see Parsl and especially HTEX move to a more serious "fail and let a higher level supervisor do something with the failure" model, inspired by erlang/OTP's "let it crash" philosophy.
For example, if any thread dies in the interchange unexpectedly, the only sensible thing to do is for the interchange to exit (unless more subtle reasoning is applied to individual errors that are experienced, turning them into expected errors). And when the interchange exits, the only reasonable thing to do is for the high throughput executor to notice that and to pass that error up as all tasks failed. (contrast this to the current behaviour in #3404 where the command thread exiting turns into a permanent hang).
This issue is to give the above concept an issue number for cross-referencing in other PRs and issues, rather than being a target for a single fix.