Open matthewrmshin opened 5 years ago
(Directed at @sadielbartholomew's comment ... which @kinow seems to understand, but I don't!)
Manuel! Hilarious. The joke here is about this issue's number @hjoliver :-) @matthewrmshin is the winner. Next target 3333
Ah - OK, I humbly bestow Idiot of the Day Award on myself 😬
Wow, 3000 issues. No wonder I feel tired. 😩
I just wanted to comment that we at the 557WW have encountered what was described above, where tasks hang out in Ready status, sometimes for quite a long time, when running suites with many hundreds of tasks. As described in https://github.com/cylc/cylc-flow/issues/2699, the time loss is very hard to quantify because it does this silently. We are running with 7.8.6 (for now we do not have py3.7), so humbly request that any fixes you might come up with be backported to 7.8.x as we will want to grab that update ASAP. Thank you for your hard work on this very useful software.
@GlennCreighton -
where tasks hang out in Ready status, sometimes for quite a long time, when running suites with many hundreds of tasks.
Can you confirm that "quite a long time" is still finite, i.e. the tasks do eventually leave the ready state and submit their jobs? If so, this almost certainly just means that you need more "suite host" resources - either bigger VMs or more of them. Check by watching the system load on your suite host VM(s) when this happens.
Job submissions (and event handlers and xtriggers, if you have many of those) get actioned by Python sub-processes, and by default we allow a maximum of 4 of those at once. The "ready" state mostly means queued to this process pool, so if tasks get added to the queue faster than the pool can execute you will see tasks stuck as "ready" for a while. If the system load is not too high you might get away with simply increasing "process pool size" in global config.
Also, beware of event handlers or xtriggers that take a long time to execute as they will tie up a process pool slot until they finish.
@hjoliver, thank you for your suggestion. Yes, I can confirm that the jobs do submit after awhile. It has taken upwards of 30 minutes at some times. I will definitely check that option out. Thank you for teaching me how it works!
you might get away with simply increasing "process pool size" in global config.
We do the same thing at NIWA, and were prompted to after switching a suite from in-built polling tasks to x-trigger alternative.
Good to know, I am going to try this out. Is there a good rule of thumb here for what to set this to, or is it basically trial and error? If otherwise empty, I'm assuming one would want the max process pool size set to one member per core, maybe leaving a couple out to avoid crowding things. Does that sound reasonable?
I think number of cores is a good guess. It's not entirely trial and error: you can monitor (e.g. with top
or something nicer like htop
) system processes when problems like this occur, and it should be pretty clear if the system load is high and the process pools of affected suites are constantly maxed out. It would also be good to get an estimate on the number of job submissions, xtriggers, and event handlers being executed at once, bearing in mind each invocation is a whole Python sub-process (however, simultaneous job submissions are batched to reduce that load).
Update: we should have a way of alerting users to a log-jammed process pool, probably the cause of this.
One thing that might factor into this problem is the time it takes to submit to the batch scheduler. Our SAs have pre-scripts that run before the job is actually queued (to ensure no hanging jobs on the reserved nodes, etc.) that can take a bit longer than average to return a submitted status, thus holding up the pool. Unfortunately, my tests with an increased pool size did not seem to mitigate this very much in the past. May need to reconsider cutting down number of tasks.
Update: 2022
The job submission pipeline of Cylc 8 is quite different to Cylc 7, however, the subprocess pool remains the same.
I think we have (probably) recently eliminated the last way to get properly "stuck in ready".
cylc trigger
to reset a ready task back to the beginning of the prep/submit pipeline?Being "delayed in ready", even for a long time, due to a rammed process pool is not really a bug.
There shouldn't be any way for tasks to get stuck in the preparing state so short of a fresh bug report (Cylc 8) I don't think there's anything for us to do here.
but just in case, we could allow (say) cylc trigger to reset a ready task back to the beginning of the prep/submit pipeline?
Simplest workaround is the restart the workflow, this will force all preparing tasks back through the pipeline.
Agreed a rammed process pool is not a bug, it's a feature, the pool limit you configured is controlling resource usage.
but could we add better diagnostics to show the user what is happening in that case?
How about logging a warning if a command remains queued for longer than a configured timeout? If we do #5017 this could be done in conjunction.
the pool limit you configured is controlling resource usage.
(Or the default limit that you didn't even know about, to be fair :grin: )
How about logging a warning if a command remains queued for longer than a configured timeout? If we do https://github.com/cylc/cylc-flow/issues/5017 this could be done in conjunction.
Seems reasonable.
I thought we should have eliminated this problem. We have in place:
But annoyingly, we still have situations where tasks are stuck in the
ready
state! Some ideas:ready
state.Related: #2699, #2964, #2999.