Open benclifford opened 1 year ago
crossref the race condition in #2627
here's a related runinfo directory where I just saw something like this, in runinfo/022.
In this log, I think scale in is not being called, but tasks are ending up on a manager 362e761e0b8b
which does not appear to belong to any of the 5 launched blocks (or at least has no log directory there), and which does not appear to belong to any earlier run either, and which dies from too many missing heartbeats 2 seconds after registering.
It has block id 0, along with another manager in the log which also has block id 0. This is suggestive that it's a manager coming from an earlier run that was not scaled in in time, and which connected to the wrong interchange. runinfo-3.9.zip
In that case, this would be related to issue #2199 where a manager can connect to the wrong interchange.
I'm working on some worker draining code in PR #3063, and one approach to this issue might be to set workers to drain in one iteration, and then later only kill draining candidates which have full drained.
Describe the bug I haven't tried to recreate this, but I saw an error in CI that suggests this might happen sometimes: in CI a block was killed by scale_in when it was running a task.
The scale in strategy asks the interchange for a list of connected managers and idle times; it then makes a decision about which blocks to scale in based on those idle times, and individually calls
hold_block
on each one.These hold block calls can take some time, as they involve an IPC to call
hold_worker
on each manager registered for each block.During that time, other parts of htex may place a task onto one of the managers which is in the list to be scaled in; that task will then be killed after it has started execution when the containing block is killed.
To Reproduce I haven't reproduced this - I think it's an extremely rare race condition that I've only seen once.
Expected behavior strategy driven scale-in should not kill blocks which are executing tasks
Environment CI for initial observation, around parsl
master
b26c04ffddd4c9b75fa1c08a7b8550185d888553