ga4gh / workflow-execution-service-schemas

The WES API is a standard way to run and manage portable workflows.
Apache License 2.0
82 stars 38 forks source link

Field for run/task error messages (beyond stderr) #195

Open cjllanwarne opened 1 year ago

cjllanwarne commented 1 year ago

Looking at the WES spec I cannot find a way to return an error message explaining why a run or task failed.

There is a field for stderr but that has different semantics than a task error or run error, so doesn't quite fit.

As an example, imagine a task looks good but emits some logging information to its stderr. It then fails because a file cannot be delocalized successfully.

The natural semantics of the stderr field is that it should come directly from the task, but that leaves no place to put the overall "reason the task is a failure", which is an engine level delocalization message.

Another motivating use case at the run level: say a workflow script is badly formed. We want to be able to mark that run as failed, and give it an error message, but that error has nothing to do with stderr and thus that field is inappropriate.

Recommendation:

uniqueg commented 1 year ago

Thanks @cjllanwarne, an interesting suggestion.

However, on the task level, I'm not quite sure I understand why you think that stderr doesn't fit. In my understanding, if a task fails, it is up to the individual tool to let the user know what went wrong, typically via the STDERR stream. A workflow engine would have no (semantic) understanding of what exactly went wrong (and neither has the platform executing the task/container), so all they do is log that there has been an error, together with the tool's STDERR stream and exit status. Or am I missing something?

On the WES side, in such a scenario, I believe RunLog.state should be set to EXECUTION_ERROR, the tool's STDERR stream should be made available in that tool's RunLog.task_logs.$.stderr property, and the workflow's STDERR stream should be made available in the run's RunLog.run_log.stderr property (and the exit_codes set for both task and workflow).

Things may be different for workflow engine errors, as the engine should know, e.g., about a badly formed workflow or issues with the task execution engines, and could therefore report errors more directly/explicitly along the lines you mention. WES wrappers could then parse the engine logs to populate such a field (or perhaps if an engine is aware that it is behind a WES, it could inform the WES more directly). And I could totally see that being useful (certainly more user friendly than showing the last n lines of the engine's STDERR stream).

cjllanwarne commented 1 year ago

So one example might be that a cloud executor kills a task because it ran out of memory - but the task itself did not emit any stdout, it just got killed.

In that case, the task error message could be something like "task killed by supervisor", and stderr should be the empty string.

wleepang commented 1 year ago

Rather than an errors list of strings, how about a statusMessage optional string property? This broadens the scope of the message to handle statuses other than EXECUTOR_ERROR at the run level or FAILED at the task level.

cjllanwarne commented 1 year ago

A statusMessage could work too, as long as we could reliably choose to forward it to users as an error if the workflow or task failed.

To be honest, I find APIs where fields have a single, clear purpose to be much easier to use than APIs with "cover all bases" fields, because "cover all bases" fields tend to need human understanding to use, whereas simple clear fields are easy to build automations around.

uniqueg commented 1 year ago

I agree that clear semantics are important, but I think if cross-referencing another field resolves any ambiguity, then I think it's fine to overload a field. And checking the workflow's or task's state would provide the necessary information to put statusMessages in context.

Still, I'm wondering: Are we only talking about SYSTEM_ERRORs here? I still don't quite see how, in the absence of standardized error reporting, we could have WES and TES servers understand workflow engine or tool errors semantically. And so I don't see how they could provide errors or statusMessages field for these errors. The examples @cjllanwarne gave all sound like SYSTEM_ERRORS to me. Conversely, stderr should be the right place to find out more details about EXECUTION_ERRORs, so I suppose that the suggested change would complement this very well, and this distinction could potentially also help address #179.

cjllanwarne commented 1 year ago

Are we only talking about SYSTEM_ERRORs here?

Yes. But just because an error is a system error doesn't mean we don't want error message information for it. And having an error message from the system and a stderr print out from the tool itself are separately useful things to have.

uniqueg commented 1 year ago

Absolutely agree with you. Just wanted to clarify.

Anyway, I certainly would like to see a PR for this, in case you are interested to raise one @cjllanwarne :) Guess that would make any remaining discussions more focused (if any)

patmagee commented 1 year ago

@cjllanwarne would #205 address some of your concerns at least at the task level?