Open sadielbartholomew opened 5 years ago
I'll just note that we can already achieve the same thing with custom task messages - by translating (in job scripting) application return codes into meaningful messages, and triggering tasks off of those. However, for applications that do have well-defined return codes for specific error conditions, this is a good proposal (as it reduces effort - no need to use custom task messages).
Ah, nice, that's a good point! THanks @hjoliver. I guess the crux of this Issue then becomes making it simpler & more explicit to set exit code specific triggering up, via the suite.rc instead of individal custom task messages.
It's a speculative one perhaps for future, so there isn't too much more to say right now I don't think!
we can already achieve the same thing with custom task messages
Kinda, but also kinda not as custom task messages aren't exit states so don't work particularly well as switches in workflows. They need to be combined with :succeed
or whatever:
foo:succeed & foo:msg1 => bar
foo:succeed & foo:msg2 => baz
bar | baz => pub
This would definitely be a nice feature, I think we may have talked about it in a June meeting a couple of years back? I remember a discussion about the awkwardness of doing this nicely at the moment as script
might not be set to a single executable but could be an inline bash-script. There could also be pre-script
, init-script
, env-script
etc, any of which could have produced the non-zero return code.
we can already achieve the same thing with custom task messages
Kinda, but also kinda not as custom task messages aren't exit states so don't work particularly well as switches in workflows. They need to be combined with
:succeed
or whatever:
Indeed. Custom messages allow to kick off dependent tasks midway execution of the triggering task, which is sometimes really useful (e.g. a polling task waiting for forecast of successive leadtimes and kicking off their processing as they become available).
In the current set up, the main issues are:
*script
is not a single command, but a script fragment that can run multiple commands.What can we do?
script
gets used to determine the return code. In https://github.com/cylc/cylc-flow/blob/db8872086857fd8d4ad5dff5b6765bb9c770dcb2/cylc/flow/etc/job.sh#L137-L139 we would capture the return code only when running the script
part.Kinda, but also kinda not as custom task messages aren't exit states so don't work particularly well as switches in workflows. They need to be combined with :succeed or whatever
Kinda, but also kinda not, but also more kinda than kinda not. As I suggested you would detect the underlying exit status in the script then send the custom message before exiting (immediately or later, do what you need). So for this use case the custom message is more or less as good as a task exit status, and you don't need to worry about using the actual task exit status in the graph as well.
That's not to deny that proper exit statuses would be better, however! (Just saying it's easy enough to workaround with current custom messages).
@matthewrmshin's suggestion may be good,
(#3440 should allow to capture the exit code from user scripts in a consistent manner.)
This question came up today in a different form regards the XCPU signal. This signal is used by (most?) platforms to convey "job hit its execution time limit" which can be useful information in workflow design. The Cylc "background" job runner also uses this signal.
failed/XCPU
message.task_events_mgr
means that you cannot capture this as an output using [outputs]xpcu = failed/XCPU
.However, with this diff:
diff --git a/cylc/flow/task_events_mgr.py b/cylc/flow/task_events_mgr.py
index f9b4d4c32..7bd5dde7e 100644
--- a/cylc/flow/task_events_mgr.py
+++ b/cylc/flow/task_events_mgr.py
@@ -788,6 +788,18 @@ class TaskEventsManager():
):
# Already failed.
return True
+
+ _completed_output = (
+ itask.state.outputs.set_message_complete(message, forced)
+ )
+ if _completed_output:
+ self.data_store_mgr.delta_task_output(itask, message)
+ trigger = itask.state.outputs.get_trigger(message)
+ LOG.info(f"[{itask}] completed output {trigger}")
+ self.setup_event_handlers(itask, trigger, message)
+ self.spawn_children(itask, message)
+ completed_output = completed_output or _completed_output
+
signal = message[len(FAIL_MESSAGE_PREFIX):]
self._db_events_insert(itask, "signaled", signal)
self.workflow_db_mgr.put_update_task_jobs(
The following example works:
[scheduling]
[[graph]]
R1 = """
foo?
foo:xcpu? => bar
"""
[runtime]
[[foo]]
script = sleep 10
execution time limit = PT5S
completion = succeeded or (failed and xcpu)
[[[outputs]]]
xcpu = failed/XCPU
[[bar]]
If we went ahead with something like this, we might want to consider the applicability of this to the other signal "prefixes" that Cylc supports:
Added the question label to flag this for discussion at a future VC when we get the time:
Suggest:
failed/XCPU
) for now (i.e. something along the lines of the above diff).failed
message with failed/<exit-code>
in the future (would close this issue).
As well as the current
failure
task state resulting from any non-zero exit status from that task's script, we could support triggering off of specific exit statuses. For example (using a syntax with parentheses for illustration, though I am aware that syntax may not be viable):this graph would distinguish & take a different scheduling course depending on whether
bar
fails with exit code1
, or2
, or any other non-zero code.While users are perhaps unlikely to have need to differentiate between direct
script
setting exit cases, I raise this because with this feature exit codes would essentially become parameters allowing for greatly extended control in scheduling. Instead of only having standard task "final" states of succeeded, failed & submit-failed (& in a sense expired, which is a final state of sorts I understand), there would be essentially unlimited (in practice, 255) possible endpoints available for users to catch in their scripts to trigger off a myriad of possible cases arising in them. Though, it would be a separate specification (e.g. the parentheses syntax); I am not suggesting the standardfailure
(&success
) cases should go, as users would often not need this advanced flexibility.Illustrative example
As a superficial example, note how various end cases of interest can be used to branch the scheduling in the below. Naturally, in a real case, the code would be much more involved; imagine the
sys.exit(N)
calls are placed at points of interest in the script control flow each with some chosenN = 0, ..., 255
.Suite.rc snippet:
Python script
bin/failure-mode-demo.py