cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
333 stars 94 forks source link

sod: premature shutdown for recurrence format 4 #5945

Open oliver-sanders opened 9 months ago

oliver-sanders commented 9 months ago

Cylc 8 may silently ignore some format 4 recurrences.

E.G:

[scheduler]
    cycle point format = CCYY
    allow implicit tasks = True

[scheduling]
    initial cycle point = 2000
    final cycle point = 2010
    [[graph]]
        # cycle forwards
        R3/2000/P1Y = f[-P1Y] => f

        # cycle backwards
        R3/P1Y/2010 = b[-P1Y] => b

Cylc 8: Spawns the f chain of tasks and shuts down on cycle 2003 Cylc 7: Spawns both the f and b chains.

See also https://github.com/cylc/cylc-flow/issues/5946

hjoliver commented 9 months ago

Just tried with the "backward" recurrence alone:

INFO - Cylc version: 8.3.0.dev
INFO - Run mode: live
INFO - Initial point: 2000
INFO - Final point: 2010
INFO - Cold start from 2000
INFO - New flow: 1 (original flow from 2000) 2024-01-31 00:41:23
DEBUG - Runahead: base point 2008
DEBUG - Runahead limit: 2010

(Folllowed by immediate shutdown)

So this is runahead-related in the sense that at start-up, when there are no tasks in the pool, we compute the runahead limit using sequence points alone - something going wrong there it seems.

MetRonnie commented 9 months ago

Looks like it is a problem with the pre-initial dependency (if that's the right term). Adding in R1//2007 = b or R1/2007 = b allows the sequence to start

hjoliver commented 9 months ago

Yeah, I think I understand it ...

hjoliver commented 9 months ago

Actually, I'm kinda surprised this worked in Cylc 7. (Did it?)

The pre-initial dependency handling only ever (as I recall) applied to the initial cycle point of the workflow, not individual sequence start points.

That rings a bell ...

hjoliver commented 9 months ago

Looks like I flagged this ages ago:

https://github.com/cylc/cylc-flow/issues/1936

And see the final comment from 2 years ago (SoD):

https://github.com/cylc/cylc-flow/issues/1936#issuecomment-1031825906

MetRonnie commented 9 months ago

Ah yes, if you change the start of the format 3 recurrence (the "forwards" example) to a point beyond the ICP e.g. R3/2003/P1Y = f[-P1Y] => f, then you get the same immediate shutdown.

Likewise if you change the end of the format 4 recurrence (the "backwards" example) to 2002 then the b jobs run

hjoliver commented 9 months ago

Right, so as I understand it that's the expected result - hence my issue above suggesting it would be nice to apply "pre-initial" logic to individual sequences not just the workflow initial cycle point.

The "workaround" is simply to not expect magical bootstrapping into an intercycle dependency, but handle it explicitly.

hjoliver commented 9 months ago

Explicit works fine:

    [[graph]]
        # count backwards
        R3/P1Y/2010 = b[-P1Y] => b
        # boostrap into the sequence
        R1/2007/P0Y = b
oliver-sanders commented 9 months ago

Actually, I'm kinda surprised this worked in Cylc 7. (Did it?)

Yes (as above)

hjoliver commented 9 months ago

Well the lone "backward" one doesn't work:

            ._.
            | |                 The Cylc Suite Engine [7.9.9]
._____._. ._| |_____.           Copyright (C) 2008-2019 NIWA
| .___| | | | | .___|   & British Crown (Met Office) & Contributors.
| !___| !_! | | !___.  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
!_____!___. |_!_____!  This program comes with ABSOLUTELY NO WARRANTY;
      .___! |          see `cylc warranty`.  It is free software, you
      !_____!           are welcome to redistribute it under certain
2024-01-31T01:21:08+13:00 INFO - Suite server: url=http://NIWA-1022450.niwa.local:43099/ pid=19403
2024-01-31T01:21:08+13:00 INFO - Run: (re)start=0 log=1
2024-01-31T01:21:08+13:00 INFO - Cylc version: 7.9.9
2024-01-31T01:21:08+13:00 INFO - Run mode: live
2024-01-31T01:21:08+13:00 INFO - Initial point: 2000
2024-01-31T01:21:08+13:00 INFO - Final point: 2010
2024-01-31T01:21:08+13:00 INFO - Cold Start 2000
2024-01-31T01:21:09+13:00 WARNING - suite stalled
2024-01-31T01:21:09+13:00 WARNING - Unmet prerequisites for b.2008:
2024-01-31T01:21:09+13:00 WARNING -  * b.2007 succeeded
hjoliver commented 9 months ago

Neither does your full example - so I'm confused!

oliverh@NIWA-1022450:~/cylc-src/dog$ cylc run --no-detach dog
            ._.
            | |                 The Cylc Suite Engine [7.9.9]
._____._. ._| |_____.           Copyright (C) 2008-2019 NIWA
| .___| | | | | .___|   & British Crown (Met Office) & Contributors.
| !___| !_! | | !___.  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
!_____!___. |_!_____!  This program comes with ABSOLUTELY NO WARRANTY;
      .___! |          see `cylc warranty`.  It is free software, you
      !_____!           are welcome to redistribute it under certain
2024-01-31T01:23:30+13:00 INFO - Suite server: url=http://NIWA-1022450.niwa.local:43028/ pid=19486
2024-01-31T01:23:30+13:00 INFO - Run: (re)start=0 log=1
2024-01-31T01:23:30+13:00 INFO - Cylc version: 7.9.9
2024-01-31T01:23:30+13:00 INFO - Run mode: live
2024-01-31T01:23:30+13:00 INFO - Initial point: 2000
2024-01-31T01:23:30+13:00 INFO - Final point: 2010
2024-01-31T01:23:30+13:00 INFO - Cold Start 2000
2024-01-31T01:23:30+13:00 INFO - [f.2000] -submit-num=01, owner@host=NIWA-1022450.niwa.local
2024-01-31T01:23:31+13:00 INFO - [f.2000] status=ready: (internal)submitted at 2024-01-31T01:23:31+13:00 for job(01)
2024-01-31T01:23:31+13:00 INFO - [f.2000] -health check settings: submission timeout=None
2024-01-31T01:23:31+13:00 INFO - [f.2000] status=submitted: (received)started at 2024-01-31T01:23:31+13:00 for job(01)
2024-01-31T01:23:31+13:00 INFO - [f.2000] -health check settings: execution timeout=None
2024-01-31T01:23:33+13:00 INFO - [f.2000] status=running: (received)succeeded at 2024-01-31T01:23:33+13:00 for job(01)
2024-01-31T01:23:34+13:00 INFO - [f.2001] -submit-num=01, owner@host=NIWA-1022450.niwa.local
2024-01-31T01:23:35+13:00 INFO - [f.2001] status=ready: (internal)submitted at 2024-01-31T01:23:35+13:00 for job(01)
2024-01-31T01:23:35+13:00 INFO - [f.2001] -health check settings: submission timeout=None
2024-01-31T01:23:35+13:00 INFO - [f.2001] status=submitted: (received)started at 2024-01-31T01:23:35+13:00 for job(01)
2024-01-31T01:23:35+13:00 INFO - [f.2001] -health check settings: execution timeout=None
2024-01-31T01:23:37+13:00 INFO - [f.2001] status=running: (received)succeeded at 2024-01-31T01:23:37+13:00 for job(01)
2024-01-31T01:23:38+13:00 INFO - [f.2002] -submit-num=01, owner@host=NIWA-1022450.niwa.local
2024-01-31T01:23:39+13:00 INFO - [f.2002] status=ready: (internal)submitted at 2024-01-31T01:23:39+13:00 for job(01)
2024-01-31T01:23:39+13:00 INFO - [f.2002] -health check settings: submission timeout=None
2024-01-31T01:23:39+13:00 INFO - [f.2002] status=submitted: (received)started at 2024-01-31T01:23:39+13:00 for job(01)
2024-01-31T01:23:39+13:00 INFO - [f.2002] -health check settings: execution timeout=None
2024-01-31T01:23:41+13:00 INFO - [f.2002] status=running: (received)succeeded at 2024-01-31T01:23:41+13:00 for job(01)
2024-01-31T01:23:43+13:00 WARNING - suite stalled
2024-01-31T01:23:43+13:00 WARNING - Unmet prerequisites for b.2008:
2024-01-31T01:23:43+13:00 WARNING -  * b.2007 succeeded
hjoliver commented 9 months ago

(Anyhow, I gotta bail, it's late here ... I'll check follow-up comments in the morning).

hjoliver commented 9 months ago

Cylc 7 stall with unsatisfied pre-spawned tasks, vs Cylc 8 shutdown with nothing to do - is expected under the circumstances - but I presume by "it worked at Cylc 7" you mean it actually ran, not that it immediately stalled.

oliver-sanders commented 9 months ago

Well the lone "backward" one doesn't work:

That's because of the pre-initial dependency which I've written up as a separate issue, see https://github.com/cylc/cylc-flow/issues/5946

This can be worked around as you've observed.

MetRonnie commented 9 months ago

So to uncross wires (as #5946 is the same issue as this one), are the real issues here #1936 and #4638?

oliver-sanders commented 9 months ago

I haven't had the time to investigate this yet to know. I suspect yes and maybe.

hjoliver commented 9 months ago

OK, sorry if I didn't read the fine print:

Cylc 7: Spawns both the f and b chains.

I guess I over-interpreted this, and that it "worked in Cylc 7", to mean both chains actually run in Cylc 7, which they don't.

We should certainly consider making this more obvious or flexible for users (hence the old issues #1936 and #4638) but technically the current Cylc 8 behaviour is correct and not a bug.

  R3/P1Y/2010 = b[-P1Y] => b  # with ICP = 2000 say

This literally says:

Plus: automatic bootstrapping into an inter-cycle dependency is a convenience (not a requirement) that only applies to the ICP, not to individual recurrences.

Therefore, under current well-defined rules of engagement the user has probably just made a configuration error such that there is literally nothing to run in that part of the graph.

So on that basis this is a "could be better" rather than a bug, and once we've decided on the approach we should consolidate this issue with one of the older ones:

oliver-sanders commented 9 months ago

Follow up question:

Should we even allow a workflow to be started if it contains unreachable sections of graph?

All functional Cylc 7 workflows will pass this test.

hjoliver commented 8 months ago

Should we even allow a workflow to be started if it contains unreachable sections of graph?

Probably not, but (obviously) achieving that requires being able to detect that at validation.

All functional Cylc 7 workflows will pass this test.

Which test? You can certainly start a Cylc 7 workflow that contains unreachable graph: any inter-cycle dependence that isn't automatically bootstrapped by pre-initial-ignore:

[scheduling]
   cycling mode = integer
   initial cycle point = 1
   [[dependencies]]
      [[[R1]]]
              graph = "foo[-P1] => foo"  # OK
          [[[R1/2/P1]]]
              graph = "bar[-P1] => bar"  # Uh-oh, unreachable.
oliver-sanders commented 8 months ago

All functional Cylc 7 workflows will pass this test.

Which test?

The test of not having an unreachable tasks.

By functional I was ruling out workflows with broken graphs that will stall when run.