cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
329 stars 93 forks source link

two instances of the same task running in parallel #6329

Closed oliver-sanders closed 1 month ago

oliver-sanders commented 1 month ago

A bug where two instances of the same task have been observed running at the same time.

Whilst investigating https://github.com/cylc/cylc-flow/issues/6315, I discovered an example where there were two parallel active submissions of the same task. This doesn't appear to be the cause of #6315 as it has been observed in the absence of this issue.

State Transitions

$ grep '20190331T0000Z/archive_logs' log/scheduler/02-start-01.log
2024-07-29T08:46:51Z INFO - [20190331T0000Z/archive_logs:waiting] => waiting(queued)
2024-07-29T08:46:51Z INFO - [20190331T0000Z/archive_logs:waiting(queued)] => waiting
2024-07-29T08:46:54Z INFO - [20190331T0000Z/archive_logs:waiting] => preparing
2024-07-29T08:47:06Z INFO - [20190331T0000Z/archive_logs/01:preparing] submitted to xcf:pbs[7678629]
2024-07-29T08:47:06Z INFO - [20190331T0000Z/archive_logs/01:preparing] => submitted
2024-07-29T08:47:07Z INFO - [20190331T0000Z/archive_logs/01:submitted] => running
2024-07-29T08:47:19Z INFO - [20190331T0000Z/archive_logs/02:running] => preparing
2024-07-29T08:47:28Z INFO - [20190331T0000Z/archive_logs/02:preparing] submitted to xcf:pbs[7678729]
2024-07-29T08:47:28Z INFO - [20190331T0000Z/archive_logs/02:preparing] => submitted
2024-07-29T08:47:36Z INFO - [20190331T0000Z/archive_logs/02:submitted] => running
2024-07-29T08:47:54Z WARNING - [20190331T0000Z/archive_logs/02:running] (received-ignored)succeeded for job(01) != job(02)
2024-07-29T08:48:00Z INFO - [20190331T0000Z/archive_logs/02:running] => succeeded
2024-07-29T08:48:00Z INFO - [20190331T0000Z/archive_logs/02:running(runahead)] => running
2024-07-29T09:10:27Z INFO - [20190331T0000Z/archive_logs/02:running] => running(held)

This line is particularly concerning:

INFO - [20190331T0000Z/archive_logs/02:running] => preparing

Which is followed by this line providing some evidence of the two parallel tasks:

WARNING - [20190331T0000Z/archive_logs/02:running] (received-ignored)succeeded for job(01) != job(02)

A task should never be able to regress from running to preparing right?!

Looking through the log for this workflow, there are 17 instances of the first message and 11 of the second. These messages relate to various tasks, they always relate to even numbered jobs (02, 04, 06), never odd numbered.

Job Timings

For further evidence of the parallel jobs we can inspect the timings in the remote logs:

$ grep 'INFO -' log/job/20190331T0000Z/archive_logs/*/job.out
log/job/20190331T0000Z/archive_logs/01/job.out:2024-07-29T08:47:06Z INFO - started
log/job/20190331T0000Z/archive_logs/01/job.out:2024-07-29T08:47:52Z INFO - succeeded
log/job/20190331T0000Z/archive_logs/02/job.out:2024-07-29T08:47:33Z INFO - started
log/job/20190331T0000Z/archive_logs/02/job.out:2024-07-29T08:47:59Z INFO - succeeded

The task was triggered (along with many others) at 2024-07-29T08:46:51Z via cylc trigger '20190331T0000Z/*'.

Reproducible(ish) example

See https://github.com/cylc/cylc-flow/issues/6329#issuecomment-2312370587

oliver-sanders commented 1 month ago

Have managed to reproduce using this example:

[scheduler]
    allow implicit tasks = True

[task parameters]
    a = 1..3
    b = 1..50

[scheduling]
    [[queues]]
        [[[default]]]
            limit = 40
    [[graph]]
        R1 = """
            <a> => <b>
        """

[runtime]
    [[<a=1>]]
        script = """
            if [[ $CYLC_TASK_SUBMIT_NUMBER -lt 2 ]]; then
                false
            fi
        """

When you run this example, one of the first three tasks will fail.

Unstick the workflow by triggering all tasks.

# start the workflow
cylc vip .

# wait for the first task to fail

# trigger all tasks
cylc trigger '<wid>//*'

Then look for evidence of parallel submissions in the workflow log:

$ cylc cat-log tmp.zu3j5ygsks | grep 'succeeded for job'
2024-08-27T12:45:19+01:00 WARNING - [1/_b28/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b24/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b30/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b26/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b35/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b41/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b37/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b27/02:preparing] (received-ignored)succeeded for job(01) != job(02)

I think it's the triggering of large numbers of tasks that leads to this bug.

...

Confirmed! Manually triggered tasks are being released via task_pool.release_queued_tasks causing duplicate submissions.

hjoliver commented 1 month ago

Confirmed! Manually triggered tasks are being released via task_pool.release_queued_tasks causing duplicate submissions.

Confirmed here too.

oliver-sanders commented 1 month ago

It looks like the same task is coming off the queue twice, could this be a queue-side bug?

hjoliver commented 1 month ago

That was difficult to track down, but I got it in the end. It's sort-of a queue bug, but not quite - see the PR.

ColemanTom commented 1 month ago

Just adding this here, I'm not sure if its related or not, but someone at my workplace reported this on Friday evening.

2024-08-30T07:09:38Z INFO - [20231130T1100Z/aa_um_recon_host:waiting(runahead)] => waiting
2024-08-30T07:09:38Z INFO - [20231130T1100Z/aa_um_recon_host:waiting] => waiting(queued)
2024-08-30T07:09:39Z INFO - [20231130T1100Z/aa_um_recon_host:waiting(queued)] => waiting
2024-08-30T07:09:39Z INFO - [20231130T1100Z/aa_um_recon_host:waiting] => preparing
2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:preparing] submitted to PLATFORM[1]
2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:preparing] => submitted
2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] submitted to PLATFORM[2]
2024-08-30T07:09:45Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] reloaded task definition
2024-08-30T07:09:45Z WARNING - [20231130T1100Z/aa_um_recon_host/01:submitted] active with pre-reload settings
2024-08-30T07:09:45Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] submitted to PLATFORM[3]
2024-08-30T07:09:45Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] submitted to PLATFORM[4]
2024-08-30T07:09:56Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] => running

Four submissions in the space of 3 seconds. You can see she was reloading the workflow if that has any impact. I think this is Cylc 8.2.

sjrennie commented 1 month ago

Thanks Tom. Actually this was 8.3.3 that I had the issue in.

ColemanTom commented 1 month ago

I stand corrected, 8.3.3. Also, @sjrennie pointed out, all of her's were submitted as job 01.

hjoliver commented 1 month ago

Hmm, this looks different, I think. There's no resetting of submitted or running states back to preparing as was the case in this issue. Are those PLATFORM placeholders literally 4 different platforms?

sjrennie commented 1 month ago

No, same platform, different PBS IDs.

hjoliver commented 1 month ago

So you can confirm those were all real job submissions?

sjrennie commented 1 month ago

Yes, PBS thought so! I could qstat them all, and they all reported back to job.status, and ran for 20 mins before I decided to kill them, since they wouldn't survive trying to write to the same file all at once, at the end of the job.

hjoliver commented 1 month ago

OK is that a grepped subset of the log? It might be useful to see the whole thing (you email the relevant excerpt to me if it's large) to see the exactly where the reload occurred.

sjrennie commented 1 month ago

Here is the exact bit where it started to go wrong. I can extract a larger bit to send to you if necessary 2024-08-30T07:09:39Z INFO - [20231130T1100Z/aa_um_recon_host:waiting(queued)] => waiting 2024-08-30T07:09:39Z INFO - [20231130T1100Z/aa_um_recon_host:waiting] => preparing 2024-08-30T07:09:40Z INFO - Command "reload_workflow" received. ID=5430acf1-4fd7-40d4-b040-ff0fa20e3c68 reload_workflow() 2024-08-30T07:09:40Z INFO - Pausing the workflow: Reloading workflow 2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:preparing] submitted to PLATFORM[1]

2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:preparing] => submitted 2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] submitted to PLATFORM[2]

ColemanTom commented 1 month ago

The [1] [2], etc are just PBS job ids. I didn't really need to mask them out. So, the 1,2,3,4 was really 6872518, 6872521, 6872525, 6872527.

hjoliver commented 1 month ago

OK thanks for the more detailed log. Maybe it's something to do with reloading while the task is in the preparing state. I'll see if I can reproduce this...

hjoliver commented 1 month ago

A-ha, reproduced - thanks for the bug report! I'll open a new issue for this.

hjoliver commented 1 month ago

Closed by #6337

hjoliver commented 1 month ago

@ColemanTom @sjrennie - I backed out of my first attempt at a fix for your new duplicate job submission, but the real thing is up now: #6345