Closed hjoliver closed 2 months ago
tasks in the preparing state at reload time will submit multiple times.
Yikes!
It's something to do with the fact that we now wait for preparing tasks to submit before actioning the requested reload (which I think is new-ish)
It is new-ish, but I would have thought that this change would have made reload safer in this regard because there can't be any preparing tasks at the time of the reload?
I haven't grokked the root cause well enough to see if we can prevent the queue attempt in the first place.
Could definitely do with pinning this down as it's likely an interaction of an internal list or task state with some other part of the system which could potentially spring a leak under other circumstances too?
It is new-ish, but I would have thought that this change would have made reload safer in this regard because there can't be any preparing tasks at the time of the reload?
That was definitely the intention, but whilst waiting for the preparing tasks to submit we repeatedly add the same job-submit commands to the process pool command queue.
[Note to self for tomorrow: my fix here might not be good enough, if batch job submission is not deterministic in terms of batch membership, during the pre_prep period...]
but whilst waiting for the preparing tasks to submit we repeatedly add the same job-submit commands to the process pool command queue.
Oh heck, whilst!
The cylc.flow.commands.reload_workflow
routine contains a mini main-loop containing the small subset of functionality required to flush preparing tasks through contained with the while schd.release_queued_tasks():
loop. I wonder if there's some other bit of logic required, but not contained within this loop?
This seems to fix it:
diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 92702b0b5..9648b025a 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -1245,7 +1245,7 @@ class Scheduler:
# don't release queued tasks, finish processing preparing tasks
pre_prep_tasks = [
itask for itask in self.pool.get_tasks()
- if itask.state(TASK_STATUS_PREPARING)
+ if itask.waiting_on_job_prep
]
# Return, if no tasks to submit.
We use the waiting_on_job_prep
flag rather than the preparing
status in the task_pool for the release-from-queue side of this.
Yes!! I was homing in on that flag myself. I think that's got it.
Nope, damn it. That fixes duplicate job submissions, but it does so by no longer waiting for preparing tasks to clear before doing the reload. The thing is, .waiting_on_job_prep
is just a subset of the preparing task state.
I guess the question is, do we need to waiting for preparing tasks to clear, or just for those with .waiting_on_job_prep
? If the latter, then I just need to fix an integration that is really testing the former. Investigating...
I guess the question is, do we need to waiting for preparing tasks to clear
Yes, this shifted other bugs.
I guess the question is, do we need to waiting for preparing tasks to clear
Yes, this shifted other bugs.
Yes, but I'm asking if those bugs were solely due to preparing (waiting_on_job_prep)
, or preparing
more generally.
In the case of the auto-restart functionality, it's preparing more generally.
The auto-restart functionality will restart the workflow on another host. Because of this, we must wait for all localhost task submissions to complete first because the localhost platform will be different on the new host.
So @oliver-sanders - just to make sure we're on the same page here, and so I can hopefully solve this problem tomorrow:
Your suggestion above to use itask.waiting_on_job_prep
flag, i.e.:
We use the
waiting_on_job_prep
flag rather than thepreparing
status in the task_pool for the release-from-queue side of this.
seems to be at odds with what you've just said (we have to wait for ALL preparing
tasks to clear, not just those waiting on job prep):
In the case of the auto-restart functionality, it's
preparing
more generally.
Assuming the latter comment overrides the former, I take it you now think we need to keep the original code in the scheduler module, and come up with a different solution to prevent the duplicate submissions?
Sorry. I wasn't proposing it as a fix (else I would have opened a PR). But pointing out that it seemed to fix the example in the issue.
OK got it.
I resorted to a functional test as I wasn't sure how to do better in the time available today.
I can't figure out the one seemingly repeatable functional test failure on this PR: tests/f/triggering/08-family-finish-any
.
Pretty sure it's unrelated. It only fails in the macos CI run. It passes in the Linux run, and it passes locally on macos for me.
triggering is NOT consistent with the reference log:
--- reference
+++ this run
-1/foo -triggered off ['1/b']
+1/foo -triggered off ['1/a', '1/b', '1/c']
The test workflow graph is:
R1 = """FAM:finish-any => foo""" # FAM is a, b, c
Task b
has script = True
, and a, c
have script = sleep 10
, but in the test run b
takes exactly 10 seconds to run just like the other two tasks, so foo
triggers off of them all at once, hence the result. I downloaded the test artifact, and b
's job logs confirm the 10 sec run time and that the scripting is just true
. 🤯
[UPDATE] damn it, kicking the macos test batch for a third time worked. Well that was a complete waste of time. I'll leave the above comment in just in case it indicates that the test is fundamentally flaky though. (Maybe by coincidence the system load was such that the "fast" task took exactly 10 seconds too...)
Close #6344 - tasks in the
preparing
state at reload time will submit multiple times.~My tentative fix prevents queuing a command to the subprocess pool if the same command is already queued or running. This works, but I haven't grokked the root cause well enough to see if we can prevent the queue attempt in the first place.~
It's something to do with the fact that we now wait for
preparing
tasks to submit before actioning the requested reload (which I think is new-ish) ... and maybe how that interacts with the evil "pre-prep" task list.Check List
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
(andconda-environment.yml
if present).?.?.x
branch.