cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
327 stars 93 forks source link

Support for on-sequence cold-starts #149

Closed hjoliver closed 10 years ago

hjoliver commented 11 years ago

cylc-dev discussion: https://groups.google.com/forum/?hl=en&fromgroups=#!topic/cylc-dev/g7dhs14sQUU

For suites with a cold-start forecast that happens to start exactly one normal cycle interval prior to the first full/normal/warm cycle, users are likely to think that the suite's "initial cycle time" should be that of the cold-start, not the first proper cycle. This does not work in general (cold-starts often are not on-sequence) but we could support it as an option for suites where it does make sense.

trwhitcomb commented 11 years ago

One scenario to handle the special case where the cold-start cycle graph is a subset of the other normal cycle graph would be to have a configuration option similar to the "exclude on start-up". That option selects tasks to be excluded from the suite execution tree until such time as they're added in along the run. An "exclude on cold-start" would have the same net effect in the initial cycle, but would not remove the tasks from the graph on future cycles. These tasks would not be removed from the graph on warm-starts either.

Tasks run as part of the cold-start could have an environment variable passed to them as well so something like $CYLC_TASK_COLD_START is set to true in those tasks, possibly eliminating the need to have any extra options in the [runtime] section of the configuration file that just set that variable.

A suite running in this configuration, then, might look like

[scheduling]
    [[special tasks]]
        exclude on cold-start = DA

    [[scheduling]]
        [[[0,6,12,18]]]
            graph = "Model[T-6] => DA => Model"

where the execution starts at the 0 hour cycle with a forecast only, then continues at the 6 hour cycle with data assimilation and a forecast.

Currently, when warm-starting, the cold-start tasks are set to have a succeeded status. The opposite could be done here. When we're cold-starting, tasks listed in the exclude on cold-start list can have their status set to successful - in the scenario above, this would lead to the DA => Model dependency being satisfied and the Model task immediately executing. Conversely, during a warm-start, no such setting is necessary. Warm-starting means that we do not immediately set those tasks to a succeeded status and force them to actually execute their tasks.

The actual task script for Model, in this case, would likely contain something like

if [[ -n "$CYLC_TASK_COLD_START ]]; then
    # set namelist options for a cold start
fi

that would run only in cold-start mode.

Still left unresolved:

trwhitcomb commented 11 years ago

From the mailing list discussion, Hilary pointed out that "dependence on the start-up task type only applies in a cold-start". Having start-up tasks only execute in a cold start would allow us to specify that a dependency between the exclude on cold-start task on a start-up task is not allowed. In the scenario above, this would mean it would look like:

[scheduling]
    [[special tasks]]
        start-up = Prep
        exclude on cold-start = DA    

    [[scheduling]]
        [[[0,6,12,18]]]
            graph = """
                    Model[T-6] => DA => Model
                    Prep => Model # only runs on first cold start
                    """
trwhitcomb commented 11 years ago

The multiple inheritance capability suggested in #134 would also help out here, since the individual tasks that would be skipped on a cold start could inherit from an additional namespace and that family name could be used in the exclude on cold-start section.

hjoliver commented 11 years ago

Tim, I think your idea would work, and it may be simplest way to cold-start a suite in which "the cold-start graph is a subset of the normal cycle graph". My only concerns at this stage are:

Another approach I'd like to think about is more like what cylc currently does: cold-start tasks have to be specified as such and they only run once (but if they happen to do exactly the same thing as a normal-cycle counterpart, just give them the exact same runtime config) ... then when cylc starts up instantiate cold-start task proxies at the initial cycle time (as now) but cycling tasks in the next cycle (instead of at the initial cycle as now). This is not quite as simple as your approach for the "cold-start graph is a subset" case, but (I think?) it handles cold-start only (non-subset) tasks better? We could support both approaches if they are each simpler and easier to understand for different types of suites.

One other thing to think about - what happens if you have several different cycling intervals in the same suite (e.g. some tasks that run every 6 hours and some every 12 hours). Currently cylc instantiates all tasks at the initial cycle time or at the next subsequent cycle time for the particular task. This would probably still work here, but we need to check..

dpmatthews commented 11 years ago

We have some other potential requirements that may relate to this.

trwhitcomb commented 11 years ago

In our case, the basic suite is small enough that either way would work (i.e. specifying tasks to exclude or include in a cold start) - my initial suggestion was based off of something that was already included (i.e. you specify the jobs to exclude on start-up) but I'm not attached to that. One thing that may affect this is the addition of multiple inheritance (#134) - you could just specify a few parent namespaces to exclude on cold-start (or include on cold-start) and then all the relevant tasks could just include that namespace in an inc= line along with their other inheritances (really more as a group identifier than anything else).

For the second point (maintaining the current cold-start task list), I've sketched it out and I don't think it would be a problem to keep that - in my initial example, the GetICs task could be roped in to a cold-start model task (by checking an environment variable and doing a copy), but it should work in the current "regular" case as well. This would also mean that those cold-start tasks may be able to avoid doing cycle offset calculations.

We actually do have different cycling intervals in the same suite - data assimilation/short forecasts run every 6 hours, and every 12 hours we kick off a long forecast. However, since the normal course of our experiments is to cold-start the system, let the data assimilation spin up, and then start doing long forecasts after a week or two, we currently use "exclude on start-up" to handle those for our experimental suites.

Dave's suggestion of spin-down and shut-down tasks would be useful for us as well - after an experiment runs, we would like to run a detailed comparison against a reference case, and it would be great to be able to include that as part of the official suite.

One way to handle Dave's need for multiple spin-up cycles would be to have the suite specify the first full cycle relative to the start time, but this seems like it would get very complicated very quickly.

hjoliver commented 11 years ago

Tim & Dave - shutdown tasks will be easy to add to cylc. Multiple-cycle spin-up and spin-down tasks may not be too hard either. Note you can already "cylc insert" a task that continues cycling until a given stop cycle - so the internals are really there already, we just need a way to express this cleanly in the suite definition, and to think about when these temporary-duration tasks should be created. Dave, do you want to put up some new Issues for these?

Tim - before we decide on which approach to go ahead with I don't think you've commented on my paragraph above starting "Another approach I'd like to think about is more like what cylc currently does:". It seems to me this might be the easiest one to implement because it could involve nothing more than bumping the first non-coldstart tasks into the next cycle when the task pool is loaded at start-up. What do you think?

trwhitcomb commented 11 years ago

Hilary - I hadn't, since I wanted to read it over and make sure I understand what you're suggesting before I commented :)

You're saying that instead of inserting everything everything into the task pool at the initial time and setting success/failure flags immediately if we're cold-starting, we hold off on the tasks that are marked as being skipped in a cold start - in that example above, that would mean that the DA task wouldn't end up in the pool and checking for dependencies until the 6Z cycle (if we were cold starting at 00Z), right?

That actually sounds like a really nice way to do it, and conceptually it matches what our top-level run script does now (i.e. just skips straight to the later tasks). It would look something like 'a task designated as being excluded on the cold start won't be eligible for execution until the next cycle after the cold start', which would handle the case of multiple cold-start tasks with multiple offsets, i.e. the following case

[scheduling]
    [[special tasks]]
        start-up = Prep
        exclude on cold-start = DA, LongForecast

    [[scheduling]]
        [[[0,6,12,18]]]
            graph = """
                    Model[T-6] => DA => Model
                    Prep => Model # only runs on first cold start
                    """
        [[[0,12]]]
             graph = """Model => LongForecast"""

cold-started at the 00Z cycle would see the first DA task included in the 06Z cycle and the first LongForecast task included in the 12Z cycle.

I like it, but I don't know what sort of changes would be required to the dependency graph (if any) if it was implemented that way.

hjoliver commented 11 years ago

Tim, I'm not sure this is as simple as you might hope. The current cold-start method fits nicely with the way cylc graphs are interpreted, namely a trigger "X => Y" defines what the task on the right triggers off at a given cycle time T. So, from "ColdFoo | Foo[T-6] => Foo" at the initial cycle T we end up with a trigger on a task ColdFoo with the same cycle T, not on a previous (non-existent) cycle. ... but "cold start as first cycle" methods do not seem to. E.g. The trouble with pushing the first instances of non-cold-start tasks into the next cycle is how to express it in the graph:

graph = "ColdFoo[T-6] | Foo[T-6] => Foo"

If this graph is still interpreted in the usual way (above) then we still have a bootstrapping problem because now Foo[T] always depends on a task in the previous cycle (so what to do at a given start cycle T?). Presumably we'd have to interpret the graph in a special way at start-up, by offsetting T appropriately to get (in effect) this:

graph = "ColdFoo[T] | Foo[T] => Foo[T+6] # and create the first Foo at T_initial + 6

or use a special cold-start graph-string:

cold-graph = "ColdFoo => Foo[T+6]" # currently we don't allow [T+/-n] on the right side of a trigger arrow graph = "Foo[T-6] => Foo" # and create the first Foo at T_initial + 6.

But might this sort of extra complexity be worse than getting users to understand the current cold-start method? (the fact that Foo does not start cycling until T+6 is not actually apparent from the graph either).

Your "exclude on cold-start" graphs also seem to require special interpretation at startup: "Model[T-6] => DA => Model" with "exclude DA at start-up": here the exclusion of DA at start-up is reasonably clear because there's a special task category to state that, but I guess cylc would also have to ignore Model[T-6] at start-up because T-6 is prior to the initial cycle? Finally this exclusion of tasks at start-up is not just a matter of omitting the excluded tasks from the task pool until the first full cycle - it also requires changing the prerequisites/triggers on the tasks that normally depend on the excluded ones (here: Model has to know not to wait on DA in the first, cold-start, cycle). (however, this is already done for start-up tasks).

(I hope this makes some sense .... it's getting late here!)

benfitzpatrick commented 10 years ago

This problem has ways of being solved in the new cycling syntax of #119.