Exit status specific triggers for highly-flexible scheduling

sadielbartholomew commented 5 years ago

As well as the current failure task state resulting from any non-zero exit status from that task's script, we could support triggering off of specific exit statuses. For example (using a syntax with parentheses for illustration, though I am aware that syntax may not be viable):

    graph = """
        foo:fail => bar     # Standard failure: captures all non-zero codes
        bar:fail(1) => pub  # New: trigger pub if bar fails with exit status 1...
        bar:fail(2) => wop  # ...but trigger wop if it instead fails with exit status 2.
                            # Any other bar exit status does not trigger anything.
    """

this graph would distinguish & take a different scheduling course depending on whether bar fails with exit code 1, or 2, or any other non-zero code.

While users are perhaps unlikely to have need to differentiate between direct script setting exit cases, I raise this because with this feature exit codes would essentially become parameters allowing for greatly extended control in scheduling. Instead of only having standard task "final" states of succeeded, failed & submit-failed (& in a sense expired, which is a final state of sorts I understand), there would be essentially unlimited (in practice, 255) possible endpoints available for users to catch in their scripts to trigger off a myriad of possible cases arising in them. Though, it would be a separate specification (e.g. the parentheses syntax); I am not suggesting the standard failure(& success) cases should go, as users would often not need this advanced flexibility.

Illustrative example

As a superficial example, note how various end cases of interest can be used to branch the scheduling in the below. Naturally, in a real case, the code would be much more involved; imagine the sys.exit(N) calls are placed at points of interest in the script control flow each with some chosen N = 0, ..., 255.

Suite.rc snippet:

[runtime]
    [[my_task]]
        script = "failure-mode-demo.py"

Python script `bin/failure-mode-demo.py`

# ...
# ...
# ... More involved code here! 'this' variable may get set.
# ...
# ...
if not this:
    sys.exit(1)  # endpoint 1: exit code 1, failure mode
try:
    import my_module
    my_module.some_operation(this)  # say this logically can hit a TypeError
except ImportError:
    sys.exit(2)  # endpoint 2: exit code 2, different failure mode
except TypeError:
    sys.exit(3)  # endpoint 3: exit code 3, different failure mode
# endpoint 4: exit code 0, success

hjoliver commented 5 years ago

I'll just note that we can already achieve the same thing with custom task messages - by translating (in job scripting) application return codes into meaningful messages, and triggering tasks off of those. However, for applications that do have well-defined return codes for specific error conditions, this is a good proposal (as it reduces effort - no need to use custom task messages).

sadielbartholomew commented 5 years ago

Ah, nice, that's a good point! THanks @hjoliver. I guess the crux of this Issue then becomes making it simpler & more explicit to set exit code specific triggering up, via the suite.rc instead of individal custom task messages.

It's a speculative one perhaps for future, so there isn't too much more to say right now I don't think!

oliver-sanders commented 5 years ago

we can already achieve the same thing with custom task messages

Kinda, but also kinda not as custom task messages aren't exit states so don't work particularly well as switches in workflows. They need to be combined with :succeed or whatever:

foo:succeed & foo:msg1 => bar
foo:succeed & foo:msg2 => baz
bar | baz => pub

This would definitely be a nice feature, I think we may have talked about it in a June meeting a couple of years back? I remember a discussion about the awkwardness of doing this nicely at the moment as script might not be set to a single executable but could be an inline bash-script. There could also be pre-script, init-script, env-script etc, any of which could have produced the non-zero return code.

One of the main positive uses I can imagine would be handling XCPU events.
One of the main negative uses I can imagine is using non-zero exit codes to communicate different success outcomes (as a proxy to task messages).

TomekTrzeciak commented 5 years ago

we can already achieve the same thing with custom task messages

Kinda, but also kinda not as custom task messages aren't exit states so don't work particularly well as switches in workflows. They need to be combined with :succeed or whatever:

Indeed. Custom messages allow to kick off dependent tasks midway execution of the triggering task, which is sometimes really useful (e.g. a polling task waiting for forecast of successive leadtimes and kicking off their processing as they become available).

matthewrmshin commented 5 years ago

In the current set up, the main issues are:

We have a job script that can fail with different return code at any point.
The value of *script is not a single command, but a script fragment that can run multiple commands.

What can we do?

The job script can be written in such a way that only the return value of the final statement of script gets used to determine the return code. In https://github.com/cylc/cylc-flow/blob/db8872086857fd8d4ad5dff5b6765bb9c770dcb2/cylc/flow/etc/job.sh#L137-L139 we would capture the return code only when running the script part.
An expected non-zero return code of the above will simply be recorded and the job script will continue to run to completion. On completion, the succeeded message will include the return code.
The task message API will be updated to understand the return code in a succeeded message.

hjoliver commented 5 years ago

Kinda, but also kinda not as custom task messages aren't exit states so don't work particularly well as switches in workflows. They need to be combined with :succeed or whatever

Kinda, but also kinda not, but also more kinda than kinda not. As I suggested you would detect the underlying exit status in the script then send the custom message before exiting (immediately or later, do what you need). So for this use case the custom message is more or less as good as a task exit status, and you don't need to worry about using the actual task exit status in the graph as well.

That's not to deny that proper exit statuses would be better, however! (Just saying it's easy enough to workaround with current custom messages).

@matthewrmshin's suggestion may be good,

TomekTrzeciak commented 4 years ago

(#3440 should allow to capture the exit code from user scripts in a consistent manner.)

oliver-sanders commented 1 week ago

This question came up today in a different form regards the XCPU signal. This signal is used by (most?) platforms to convey "job hit its execution time limit" which can be useful information in workflow design. The Cylc "background" job runner also uses this signal.

Currently the Cylc job script traps this signal.
And sends back the failed/XCPU message.
However, the logic of task_events_mgr means that you cannot capture this as an output using [outputs]xpcu = failed/XCPU.
So you cannot pull this information into the graph.

However, with this diff:

diff --git a/cylc/flow/task_events_mgr.py b/cylc/flow/task_events_mgr.py
index f9b4d4c32..7bd5dde7e 100644
--- a/cylc/flow/task_events_mgr.py
+++ b/cylc/flow/task_events_mgr.py
@@ -788,6 +788,18 @@ class TaskEventsManager():
             ):
                 # Already failed.
                 return True
+
+            _completed_output = (
+                itask.state.outputs.set_message_complete(message, forced)
+            )
+            if _completed_output:
+                self.data_store_mgr.delta_task_output(itask, message)
+                trigger = itask.state.outputs.get_trigger(message)
+                LOG.info(f"[{itask}] completed output {trigger}")
+                self.setup_event_handlers(itask, trigger, message)
+                self.spawn_children(itask, message)
+                completed_output = completed_output or _completed_output
+
             signal = message[len(FAIL_MESSAGE_PREFIX):]
             self._db_events_insert(itask, "signaled", signal)
             self.workflow_db_mgr.put_update_task_jobs(

The following example works:

[scheduling]
    [[graph]]
        R1 = """
            foo?
            foo:xcpu? => bar
        """

[runtime]
    [[foo]]
        script = sleep 10
        execution time limit = PT5S
        completion = succeeded or (failed and xcpu)
        [[[outputs]]]
            xcpu = failed/XCPU

    [[bar]]

If we went ahead with something like this, we might want to consider the applicability of this to the other signal "prefixes" that Cylc supports:

https://github.com/cylc/cylc-flow/blob/4b1adfc5b777102a8b9d65fdaa5ba1ce896b46a4/cylc/flow/task_message.py#L47-L49

oliver-sanders commented 1 week ago

Added the question label to flag this for discussion at a future VC when we get the time:

Suggest:

Exposing existing task messages (e.g. failed/XCPU) for now (i.e. something along the lines of the above diff).
Replacing the failed message with failed/<exit-code> in the future (would close this issue).
Documenting that the exit code is a blunt tool (which part of script did this error come from? Use targetted task messages if it is important), but can be useful information for monitoring / debugging.

cylc / cylc-flow