Open cjllanwarne opened 1 year ago
Thanks @cjllanwarne, an interesting suggestion.
However, on the task level, I'm not quite sure I understand why you think that stderr
doesn't fit. In my understanding, if a task fails, it is up to the individual tool to let the user know what went wrong, typically via the STDERR
stream. A workflow engine would have no (semantic) understanding of what exactly went wrong (and neither has the platform executing the task/container), so all they do is log that there has been an error, together with the tool's STDERR
stream and exit status. Or am I missing something?
On the WES side, in such a scenario, I believe RunLog.state
should be set to EXECUTION_ERROR
, the tool's STDERR
stream should be made available in that tool's RunLog.task_logs.$.stderr
property, and the workflow's STDERR
stream should be made available in the run's RunLog.run_log.stderr
property (and the exit_code
s set for both task and workflow).
Things may be different for workflow engine errors, as the engine should know, e.g., about a badly formed workflow or issues with the task execution engines, and could therefore report errors more directly/explicitly along the lines you mention. WES wrappers could then parse the engine logs to populate such a field (or perhaps if an engine is aware that it is behind a WES, it could inform the WES more directly). And I could totally see that being useful (certainly more user friendly than showing the last n lines of the engine's STDERR
stream).
So one example might be that a cloud executor kills a task because it ran out of memory - but the task itself did not emit any stdout, it just got killed.
In that case, the task error message could be something like "task killed by supervisor", and stderr should be the empty string.
Rather than an errors
list of strings, how about a statusMessage
optional string property? This broadens the scope of the message to handle statuses other than EXECUTOR_ERROR
at the run level or FAILED
at the task level.
A statusMessage
could work too, as long as we could reliably choose to forward it to users as an error if the workflow or task failed.
To be honest, I find APIs where fields have a single, clear purpose to be much easier to use than APIs with "cover all bases" fields, because "cover all bases" fields tend to need human understanding to use, whereas simple clear fields are easy to build automations around.
I agree that clear semantics are important, but I think if cross-referencing another field resolves any ambiguity, then I think it's fine to overload a field. And checking the workflow's or task's state would provide the necessary information to put statusMessages
in context.
Still, I'm wondering: Are we only talking about SYSTEM_ERROR
s here? I still don't quite see how, in the absence of standardized error reporting, we could have WES and TES servers understand workflow engine or tool errors semantically. And so I don't see how they could provide errors
or statusMessages
field for these errors. The examples @cjllanwarne gave all sound like SYSTEM_ERRORS
to me. Conversely, stderr
should be the right place to find out more details about EXECUTION_ERROR
s, so I suppose that the suggested change would complement this very well, and this distinction could potentially also help address #179.
Are we only talking about SYSTEM_ERRORs here?
Yes. But just because an error is a system error doesn't mean we don't want error message information for it. And having an error message from the system and a stderr print out from the tool itself are separately useful things to have.
Absolutely agree with you. Just wanted to clarify.
Anyway, I certainly would like to see a PR for this, in case you are interested to raise one @cjllanwarne :) Guess that would make any remaining discussions more focused (if any)
@cjllanwarne would #205 address some of your concerns at least at the task level?
Looking at the WES spec I cannot find a way to return an error message explaining why a run or task failed.
There is a field for
stderr
but that has different semantics than a task error or run error, so doesn't quite fit.As an example, imagine a task looks good but emits some logging information to its
stderr
. It then fails because a file cannot be delocalized successfully.The natural semantics of the
stderr
field is that it should come directly from the task, but that leaves no place to put the overall "reason the task is a failure", which is an engine level delocalization message.Another motivating use case at the run level: say a workflow script is badly formed. We want to be able to mark that run as failed, and give it an error message, but that error has nothing to do with stderr and thus that field is inappropriate.
Recommendation:
RunLog
andTaskLog
an additional, and optional,errors
field, containing a list of strings describing why the engine decided to mark that run or task as failed.