cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
335 stars 94 forks source link

Fix duplicate job submissions on reload. #6345

Closed hjoliver closed 2 months ago

hjoliver commented 2 months ago

Close #6344 - tasks in the preparing state at reload time will submit multiple times.

~My tentative fix prevents queuing a command to the subprocess pool if the same command is already queued or running. This works, but I haven't grokked the root cause well enough to see if we can prevent the queue attempt in the first place.~

It's something to do with the fact that we now wait for preparing tasks to submit before actioning the requested reload (which I think is new-ish) ... and maybe how that interacts with the evil "pre-prep" task list.

Check List

oliver-sanders commented 2 months ago

tasks in the preparing state at reload time will submit multiple times.

Yikes!

It's something to do with the fact that we now wait for preparing tasks to submit before actioning the requested reload (which I think is new-ish)

It is new-ish, but I would have thought that this change would have made reload safer in this regard because there can't be any preparing tasks at the time of the reload?

I haven't grokked the root cause well enough to see if we can prevent the queue attempt in the first place.

Could definitely do with pinning this down as it's likely an interaction of an internal list or task state with some other part of the system which could potentially spring a leak under other circumstances too?

hjoliver commented 2 months ago

It is new-ish, but I would have thought that this change would have made reload safer in this regard because there can't be any preparing tasks at the time of the reload?

That was definitely the intention, but whilst waiting for the preparing tasks to submit we repeatedly add the same job-submit commands to the process pool command queue.

[Note to self for tomorrow: my fix here might not be good enough, if batch job submission is not deterministic in terms of batch membership, during the pre_prep period...]

oliver-sanders commented 2 months ago

but whilst waiting for the preparing tasks to submit we repeatedly add the same job-submit commands to the process pool command queue.

Oh heck, whilst!

The cylc.flow.commands.reload_workflow routine contains a mini main-loop containing the small subset of functionality required to flush preparing tasks through contained with the while schd.release_queued_tasks(): loop. I wonder if there's some other bit of logic required, but not contained within this loop?

oliver-sanders commented 2 months ago

This seems to fix it:

diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 92702b0b5..9648b025a 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -1245,7 +1245,7 @@ class Scheduler:
             # don't release queued tasks, finish processing preparing tasks
             pre_prep_tasks = [
                 itask for itask in self.pool.get_tasks()
-                if itask.state(TASK_STATUS_PREPARING)
+                if itask.waiting_on_job_prep
             ]

         # Return, if no tasks to submit.

We use the waiting_on_job_prep flag rather than the preparing status in the task_pool for the release-from-queue side of this.

hjoliver commented 2 months ago

Yes!! I was homing in on that flag myself. I think that's got it.

Nope, damn it. That fixes duplicate job submissions, but it does so by no longer waiting for preparing tasks to clear before doing the reload. The thing is, .waiting_on_job_prep is just a subset of the preparing task state.

I guess the question is, do we need to waiting for preparing tasks to clear, or just for those with .waiting_on_job_prep? If the latter, then I just need to fix an integration that is really testing the former. Investigating...

oliver-sanders commented 2 months ago

I guess the question is, do we need to waiting for preparing tasks to clear

Yes, this shifted other bugs.

hjoliver commented 2 months ago

I guess the question is, do we need to waiting for preparing tasks to clear

Yes, this shifted other bugs.

Yes, but I'm asking if those bugs were solely due to preparing (waiting_on_job_prep), or preparing more generally.

oliver-sanders commented 2 months ago

In the case of the auto-restart functionality, it's preparing more generally.

The auto-restart functionality will restart the workflow on another host. Because of this, we must wait for all localhost task submissions to complete first because the localhost platform will be different on the new host.

hjoliver commented 2 months ago

So @oliver-sanders - just to make sure we're on the same page here, and so I can hopefully solve this problem tomorrow:

Your suggestion above to use itask.waiting_on_job_prep flag, i.e.:

We use the waiting_on_job_prep flag rather than the preparing status in the task_pool for the release-from-queue side of this.

seems to be at odds with what you've just said (we have to wait for ALL preparing tasks to clear, not just those waiting on job prep):

In the case of the auto-restart functionality, it's preparing more generally.

Assuming the latter comment overrides the former, I take it you now think we need to keep the original code in the scheduler module, and come up with a different solution to prevent the duplicate submissions?

oliver-sanders commented 2 months ago

Sorry. I wasn't proposing it as a fix (else I would have opened a PR). But pointing out that it seemed to fix the example in the issue.

hjoliver commented 2 months ago

OK got it.

I resorted to a functional test as I wasn't sure how to do better in the time available today.

hjoliver commented 2 months ago

I can't figure out the one seemingly repeatable functional test failure on this PR: tests/f/triggering/08-family-finish-any.

Pretty sure it's unrelated. It only fails in the macos CI run. It passes in the Linux run, and it passes locally on macos for me.

    triggering is NOT consistent with the reference log:
    --- reference
    +++ this run
    -1/foo -triggered off ['1/b']
    +1/foo -triggered off ['1/a', '1/b', '1/c']

The test workflow graph is:

   R1 = """FAM:finish-any => foo"""  # FAM is a, b, c

Task b has script = True, and a, c have script = sleep 10, but in the test run b takes exactly 10 seconds to run just like the other two tasks, so foo triggers off of them all at once, hence the result. I downloaded the test artifact, and b's job logs confirm the 10 sec run time and that the scripting is just true. 🤯

[UPDATE] damn it, kicking the macos test batch for a third time worked. Well that was a complete waste of time. I'll leave the above comment in just in case it indicates that the test is fundamentally flaky though. (Maybe by coincidence the system load was such that the "fast" task took exactly 10 seconds too...)