Parsl / parsl

Parsl - a Python parallel scripting library
http://parsl-project.org
Apache License 2.0
486 stars 194 forks source link

possible race condition in strategy-driven scale_in resulting in killed tasks #2769

Open benclifford opened 1 year ago

benclifford commented 1 year ago

Describe the bug I haven't tried to recreate this, but I saw an error in CI that suggests this might happen sometimes: in CI a block was killed by scale_in when it was running a task.

The scale in strategy asks the interchange for a list of connected managers and idle times; it then makes a decision about which blocks to scale in based on those idle times, and individually calls hold_block on each one.

These hold block calls can take some time, as they involve an IPC to call hold_worker on each manager registered for each block.

During that time, other parts of htex may place a task onto one of the managers which is in the list to be scaled in; that task will then be killed after it has started execution when the containing block is killed.

To Reproduce I haven't reproduced this - I think it's an extremely rare race condition that I've only seen once.

Expected behavior strategy driven scale-in should not kill blocks which are executing tasks

Environment CI for initial observation, around parsl master b26c04ffddd4c9b75fa1c08a7b8550185d888553

benclifford commented 1 year ago

crossref the race condition in #2627

benclifford commented 1 year ago

here's a related runinfo directory where I just saw something like this, in runinfo/022.

In this log, I think scale in is not being called, but tasks are ending up on a manager 362e761e0b8b which does not appear to belong to any of the 5 launched blocks (or at least has no log directory there), and which does not appear to belong to any earlier run either, and which dies from too many missing heartbeats 2 seconds after registering.

It has block id 0, along with another manager in the log which also has block id 0. This is suggestive that it's a manager coming from an earlier run that was not scaled in in time, and which connected to the wrong interchange. runinfo-3.9.zip

In that case, this would be related to issue #2199 where a manager can connect to the wrong interchange.

benclifford commented 6 months ago

I'm working on some worker draining code in PR #3063, and one approach to this issue might be to set workers to drain in one iteration, and then later only kill draining candidates which have full drained.