Closed oliver-sanders closed 1 month ago
Have managed to reproduce using this example:
[scheduler]
allow implicit tasks = True
[task parameters]
a = 1..3
b = 1..50
[scheduling]
[[queues]]
[[[default]]]
limit = 40
[[graph]]
R1 = """
<a> => <b>
"""
[runtime]
[[<a=1>]]
script = """
if [[ $CYLC_TASK_SUBMIT_NUMBER -lt 2 ]]; then
false
fi
"""
When you run this example, one of the first three tasks will fail.
Unstick the workflow by triggering all tasks.
# start the workflow
cylc vip .
# wait for the first task to fail
# trigger all tasks
cylc trigger '<wid>//*'
Then look for evidence of parallel submissions in the workflow log:
$ cylc cat-log tmp.zu3j5ygsks | grep 'succeeded for job'
2024-08-27T12:45:19+01:00 WARNING - [1/_b28/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b24/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b30/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b26/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b35/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b41/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b37/02:preparing] (received-ignored)succeeded for job(01) != job(02)
2024-08-27T12:45:19+01:00 WARNING - [1/_b27/02:preparing] (received-ignored)succeeded for job(01) != job(02)
I think it's the triggering of large numbers of tasks that leads to this bug.
...
Confirmed! Manually triggered tasks are being released via task_pool.release_queued_tasks
causing duplicate submissions.
Confirmed! Manually triggered tasks are being released via task_pool.release_queued_tasks causing duplicate submissions.
Confirmed here too.
It looks like the same task is coming off the queue twice, could this be a queue-side bug?
That was difficult to track down, but I got it in the end. It's sort-of a queue bug, but not quite - see the PR.
Just adding this here, I'm not sure if its related or not, but someone at my workplace reported this on Friday evening.
2024-08-30T07:09:38Z INFO - [20231130T1100Z/aa_um_recon_host:waiting(runahead)] => waiting
2024-08-30T07:09:38Z INFO - [20231130T1100Z/aa_um_recon_host:waiting] => waiting(queued)
2024-08-30T07:09:39Z INFO - [20231130T1100Z/aa_um_recon_host:waiting(queued)] => waiting
2024-08-30T07:09:39Z INFO - [20231130T1100Z/aa_um_recon_host:waiting] => preparing
2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:preparing] submitted to PLATFORM[1]
2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:preparing] => submitted
2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] submitted to PLATFORM[2]
2024-08-30T07:09:45Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] reloaded task definition
2024-08-30T07:09:45Z WARNING - [20231130T1100Z/aa_um_recon_host/01:submitted] active with pre-reload settings
2024-08-30T07:09:45Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] submitted to PLATFORM[3]
2024-08-30T07:09:45Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] submitted to PLATFORM[4]
2024-08-30T07:09:56Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] => running
Four submissions in the space of 3 seconds. You can see she was reloading the workflow if that has any impact. I think this is Cylc 8.2.
Thanks Tom. Actually this was 8.3.3 that I had the issue in.
I stand corrected, 8.3.3. Also, @sjrennie pointed out, all of her's were submitted as job 01.
Hmm, this looks different, I think. There's no resetting of submitted
or running
states back to preparing
as was the case in this issue. Are those PLATFORM placeholders literally 4 different platforms?
No, same platform, different PBS IDs.
So you can confirm those were all real job submissions?
Yes, PBS thought so! I could qstat them all, and they all reported back to job.status, and ran for 20 mins before I decided to kill them, since they wouldn't survive trying to write to the same file all at once, at the end of the job.
OK is that a grepped subset of the log? It might be useful to see the whole thing (you email the relevant excerpt to me if it's large) to see the exactly where the reload occurred.
Here is the exact bit where it started to go wrong. I can extract a larger bit to send to you if necessary 2024-08-30T07:09:39Z INFO - [20231130T1100Z/aa_um_recon_host:waiting(queued)] => waiting 2024-08-30T07:09:39Z INFO - [20231130T1100Z/aa_um_recon_host:waiting] => preparing 2024-08-30T07:09:40Z INFO - Command "reload_workflow" received. ID=5430acf1-4fd7-40d4-b040-ff0fa20e3c68 reload_workflow() 2024-08-30T07:09:40Z INFO - Pausing the workflow: Reloading workflow 2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:preparing] submitted to PLATFORM[1]
2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:preparing] => submitted 2024-08-30T07:09:43Z INFO - [20231130T1100Z/aa_um_recon_host/01:submitted] submitted to PLATFORM[2]
The [1]
[2]
, etc are just PBS job ids. I didn't really need to mask them out. So, the 1,2,3,4 was really 6872518, 6872521, 6872525, 6872527.
OK thanks for the more detailed log. Maybe it's something to do with reloading while the task is in the preparing state. I'll see if I can reproduce this...
A-ha, reproduced - thanks for the bug report! I'll open a new issue for this.
Closed by #6337
@ColemanTom @sjrennie - I backed out of my first attempt at a fix for your new duplicate job submission, but the real thing is up now: #6345
A bug where two instances of the same task have been observed running at the same time.
Whilst investigating https://github.com/cylc/cylc-flow/issues/6315, I discovered an example where there were two parallel active submissions of the same task. This doesn't appear to be the cause of #6315 as it has been observed in the absence of this issue.
State Transitions
This line is particularly concerning:
Which is followed by this line providing some evidence of the two parallel tasks:
A task should never be able to regress from running to preparing right?!
Looking through the log for this workflow, there are 17 instances of the first message and 11 of the second. These messages relate to various tasks, they always relate to even numbered jobs (02, 04, 06), never odd numbered.
Job Timings
For further evidence of the parallel jobs we can inspect the timings in the remote logs:
The task was triggered (along with many others) at
2024-07-29T08:46:51Z
viacylc trigger '20190331T0000Z/*'
.Reproducible(ish) example
See https://github.com/cylc/cylc-flow/issues/6329#issuecomment-2312370587