How to handle tasks stuck in ready?

matthewrmshin commented 5 years ago

I thought we should have eliminated this problem. We have in place:

Timeout for job submission commands.
Callback for job submission commands that detect bad job submissions.
Similar care with job host initialization commands.

But annoyingly, we still have situations where tasks are stuck in the ready state! Some ideas:

Users should be notified of this - some kind of ready timeout event?
Users should be able to easily get their tasks out of the ready state.
We need more diagnostics when this happens.

Related: #2699, #2964, #2999.

sadielbartholomew commented 5 years ago

(:three: :zero: :zero: :zero: :trophy:)

hjoliver commented 5 years ago

(Directed at @sadielbartholomew's comment ... which @kinow seems to understand, but I don't!)

kinow commented 5 years ago

Manuel! Hilarious. The joke here is about this issue's number @hjoliver :-) @matthewrmshin is the winner. Next target 3333

hjoliver commented 5 years ago

Ah - OK, I humbly bestow Idiot of the Day Award on myself 😬

hjoliver commented 5 years ago

Wow, 3000 issues. No wonder I feel tired. 😩

GlennCreighton commented 4 years ago

I just wanted to comment that we at the 557WW have encountered what was described above, where tasks hang out in Ready status, sometimes for quite a long time, when running suites with many hundreds of tasks. As described in https://github.com/cylc/cylc-flow/issues/2699, the time loss is very hard to quantify because it does this silently. We are running with 7.8.6 (for now we do not have py3.7), so humbly request that any fixes you might come up with be backported to 7.8.x as we will want to grab that update ASAP. Thank you for your hard work on this very useful software.

hjoliver commented 4 years ago

@GlennCreighton -

where tasks hang out in Ready status, sometimes for quite a long time, when running suites with many hundreds of tasks.

Can you confirm that "quite a long time" is still finite, i.e. the tasks do eventually leave the ready state and submit their jobs? If so, this almost certainly just means that you need more "suite host" resources - either bigger VMs or more of them. Check by watching the system load on your suite host VM(s) when this happens.

Job submissions (and event handlers and xtriggers, if you have many of those) get actioned by Python sub-processes, and by default we allow a maximum of 4 of those at once. The "ready" state mostly means queued to this process pool, so if tasks get added to the queue faster than the pool can execute you will see tasks stuck as "ready" for a while. If the system load is not too high you might get away with simply increasing "process pool size" in global config.

Also, beware of event handlers or xtriggers that take a long time to execute as they will tie up a process pool slot until they finish.

GlennCreighton commented 4 years ago

@hjoliver, thank you for your suggestion. Yes, I can confirm that the jobs do submit after awhile. It has taken upwards of 30 minutes at some times. I will definitely check that option out. Thank you for teaching me how it works!

dwsutherland commented 4 years ago

you might get away with simply increasing "process pool size" in global config.

We do the same thing at NIWA, and were prompted to after switching a suite from in-built polling tasks to x-trigger alternative.

GlennCreighton commented 4 years ago

Good to know, I am going to try this out. Is there a good rule of thumb here for what to set this to, or is it basically trial and error? If otherwise empty, I'm assuming one would want the max process pool size set to one member per core, maybe leaving a couple out to avoid crowding things. Does that sound reasonable?

hjoliver commented 4 years ago

I think number of cores is a good guess. It's not entirely trial and error: you can monitor (e.g. with top or something nicer like htop) system processes when problems like this occur, and it should be pretty clear if the system load is high and the process pools of affected suites are constantly maxed out. It would also be good to get an estimate on the number of job submissions, xtriggers, and event handlers being executed at once, bearing in mind each invocation is a whole Python sub-process (however, simultaneous job submissions are batched to reduce that load).

hjoliver commented 3 years ago

Update: we should have a way of alerting users to a log-jammed process pool, probably the cause of this.

GlennCreighton commented 3 years ago

One thing that might factor into this problem is the time it takes to submit to the batch scheduler. Our SAs have pre-scripts that run before the job is actually queued (to ensure no hanging jobs on the reserved nodes, etc.) that can take a bit longer than average to return a submitted status, thus holding up the pool. Unfortunately, my tests with an increased pool size did not seem to mitigate this very much in the past. May need to reconsider cutting down number of tasks.

oliver-sanders commented 2 years ago

Update: 2022

The job submission pipeline of Cylc 8 is quite different to Cylc 7, however, the subprocess pool remains the same.

hjoliver commented 2 years ago

I think we have (probably) recently eliminated the last way to get properly "stuck in ready".

but just in case, we could allow (say) cylc trigger to reset a ready task back to the beginning of the prep/submit pipeline?

Being "delayed in ready", even for a long time, due to a rammed process pool is not really a bug.

but could we add better diagnostics to show the user what is happening in that case?

oliver-sanders commented 2 years ago

There shouldn't be any way for tasks to get stuck in the preparing state so short of a fresh bug report (Cylc 8) I don't think there's anything for us to do here.

but just in case, we could allow (say) cylc trigger to reset a ready task back to the beginning of the prep/submit pipeline?

Simplest workaround is the restart the workflow, this will force all preparing tasks back through the pipeline.

Agreed a rammed process pool is not a bug, it's a feature, the pool limit you configured is controlling resource usage.

but could we add better diagnostics to show the user what is happening in that case?

How about logging a warning if a command remains queued for longer than a configured timeout? If we do #5017 this could be done in conjunction.

hjoliver commented 2 years ago

the pool limit you configured is controlling resource usage.

(Or the default limit that you didn't even know about, to be fair :grin: )

How about logging a warning if a command remains queued for longer than a configured timeout? If we do https://github.com/cylc/cylc-flow/issues/5017 this could be done in conjunction.

Seems reasonable.

cylc / cylc-flow

How to handle tasks stuck in ready? #3000

(:three: :zero: :zero: :zero: :trophy:)