cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
333 stars 94 forks source link

Tasks reported as succeeded(runahead) #4868

Closed dpmatthews closed 2 years ago

dpmatthews commented 2 years ago

If I run the following workflow:

[scheduler]
    allow implicit tasks = True
[scheduling]
    initial cycle point = 20220101T0000Z
    final cycle point = 20220101T0000Z
    [[graph]]
        R1 = """
               install_cold => model_retrieve_amm15sst_cold => model_ants_amm15_to_model_cold
               install_cold => model_createbc_lbc_cold & model_install_startdata_cold
               model_install_startdata_cold => model_varbc_reset => housekeep_cold
               model_install_startdata_cold & model_ants_amm15_to_model_cold => model_recon_cold
               model_createbc_lbc_cold & model_recon_cold => model_fcst_cold => model_recon_ls_cold => housekeep_cold
               model_retrieve_lbc_cold => model_createbc_lbc_cold
            """
        T00 = "model_createbc_lbc_cold[^] => model_createbc_lbc"
[runtime]
    [[root]]
        script=true

then I get the following entries in the workflow log:

2022-05-12T10:38:25+01:00 INFO - [20220101T0000Z/model_createbc_lbc_cold waiting(runahead) job:01 flows:1] => succeeded(runahead)
2022-05-12T10:38:33+01:00 INFO - [20220101T0000Z/model_varbc_reset waiting(runahead) job:01 flows:1] => succeeded(runahead)

These are very confusing. Given that succeeded(runahead) isn't an expected state, I think this can be regarded as a bug.

See also #3647.

oliver-sanders commented 2 years ago

Possibly caused by tasks drifting out of the n=0/1 window then drifting back in again at a later time?

hjoliver commented 2 years ago

These tasks progress as normal, then the one weird log message appears some time later:

cylc cat-log dm | grep model_varbc_reset
2022-05-13T12:54:12+12:00 INFO - [20220101T0000Z/model_varbc_reset waiting(runahead) job:00 flows:1] spawned
2022-05-13T12:54:12+12:00 INFO - [20220101T0000Z/model_varbc_reset waiting(runahead) job:00 flows:1] => waiting
2022-05-13T12:54:12+12:00 INFO - [20220101T0000Z/model_varbc_reset waiting job:00 flows:1] => waiting(queued)
2022-05-13T12:54:13+12:00 INFO - [20220101T0000Z/model_varbc_reset waiting(queued) job:00 flows:1] => waiting
2022-05-13T12:54:13+12:00 INFO - Queue released: 20220101T0000Z/model_varbc_reset
2022-05-13T12:54:13+12:00 INFO - [20220101T0000Z/model_varbc_reset waiting job:00 flows:1] => preparing
2022-05-13T12:54:13+12:00 INFO - [20220101T0000Z/model_varbc_reset preparing job:01 flows:1] host=niwa-1007823l
2022-05-13T12:54:14+12:00 INFO - [20220101T0000Z/model_varbc_reset preparing job:01 flows:1] (internal)submitted at 2022-05-13T12:54:14+12:00
2022-05-13T12:54:14+12:00 INFO - [20220101T0000Z/model_varbc_reset preparing job:01 flows:1] submitted to localhost:background[26146]
2022-05-13T12:54:14+12:00 INFO - [20220101T0000Z/model_varbc_reset preparing job:01 flows:1] => submitted
2022-05-13T12:54:14+12:00 INFO - [20220101T0000Z/model_varbc_reset submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
2022-05-13T12:54:16+12:00 INFO - [20220101T0000Z/model_varbc_reset submitted job:01 flows:1] (received)started at 2022-05-13T12:54:15+12:00
2022-05-13T12:54:16+12:00 INFO - [20220101T0000Z/model_varbc_reset submitted job:01 flows:1] => running
2022-05-13T12:54:16+12:00 INFO - [20220101T0000Z/model_varbc_reset running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2022-05-13T12:54:26+12:00 INFO - [20220101T0000Z/model_varbc_reset running job:01 flows:1] (received)succeeded at 2022-05-13T12:54:25+12:00
2022-05-13T12:54:26+12:00 INFO - [20220101T0000Z/model_varbc_reset running job:01 flows:1] => succeeded
...
2022-05-13T12:55:08+12:00 INFO - [20220101T0000Z/model_varbc_reset waiting(runahead) job:01 flows:1] => succeeded(runahead)

(Note I put script = sleep 10 in root).

hjoliver commented 2 years ago

Possibly caused by tasks drifting out of the n=0/1 window then drifting back in again at a later time?

Confirmed. The datastore is calling TaskProxy.state_reset(), which actions and logs state changes, on loading task history from the DB ... which isn't a "real" state change.

hjoliver commented 2 years ago

(Note I took a look at this with some urgency, in case it indicated a problem with the scheduling algorithm (fortunately it didn't).