ga4gh / workflow-execution-service-schemas

The WES API is a standard way to run and manage portable workflows.
Apache License 2.0
82 stars 38 forks source link

Semantics of error states ambiguous #179

Open uniqueg opened 2 years ago

uniqueg commented 2 years ago

Description of the problem

Running different WES implementations, we ran into the issue that the definition of state EXECUTOR_ERROR is somewhat ambiguous.

The relevant states are currently defined as:

    + COMPLETE: The task has completed running. Executors have exited without error
    and output files have been successfully uploaded.
    + EXECUTOR_ERROR: The task encountered an error in one of the Executor processes. Generally,
  this means that an Executor exited with a non-zero exit code.
    + SYSTEM_ERROR: The task was stopped due to a system error, but not from an Executor,
  for example an upload failed due to network issues, the worker's ran out of disk space, etc.

Now, the defintion of COMPLETE perhaps makes it fairly clear that it should only be assigned if a workflow completed successfully with no issues whatsoever (return_code is 0). However, what is perhaps not so clear is when an EXECUTOR_ERROR should be assigned. Would it include only failures of jobs or system errors"of the executor itself as well. The EXECUTOR_ERROR definition rather suggests the latter, the SYSTEM_ERROR definition rather the former.

The question becomes perhaps clearer in a WES/TES setting. Should WES return EXECUTOR_ERROR even if a TES job returns a SYSTEM_ERROR (note that the definitions of EXECUTOR_ERROR and SYSTEM_ERROR are the same for WES and TES)? It's arguably an error "in the executor" either way, and so by the definition of SYSTEM_ERROR in WES is not eligible to be assigned a WES SYSTEM_ERROR.

Personally, I feel that SYSTEM_ERRORs in executors should be assigned SYSTEM_ERRORs in WES as well, because of the implications this has for users: For an EXECUTOR_ERROR there is likely something wrong on the user's end, and the user should carefully check the workflow and run request before bothering with trying again. Whereas for a SYSTEM_ERROR the user might want to try again and file a bug report if the error persists.

Possible solution

Make it more clear what is and what isn't to be considered an EXECUTOR_ERROR and a SYSTEM_ERROR on the level of WES, e.g., by including what the states signify to the client/user.

patmagee commented 2 years ago

@uniqueg thanks for raising this. Ambiguity here has also lead me to likely misusing these error states in different cases as well. I have typically held to the broad definition the SYSTEM_ERROR is something related to network issues, quota (if running on a cloud), or infrastructure related problems (ie system goes down, db goes down etc), while EXECUTOR_ERROR is pretty much everything else.

To me EXECUTOR_ERROR has always been a bit strange, the name seems to hint at an application error as opposed to an error encountered during the running of a workflow. Thinking about this from a user's perspective, I think they broadly only care about whether their workflow completed successfully or not. A simpler team may have been failed.. but this tangent is a bit beside the point.

IMO the difference between a SYSTEM_ERROR happening in the WES server, or in distributed TES backend should not be exposed to the user. To the user, they are belong to a single API and so should be treated in a uniform way when conveying state. To that end I think it makese sense to return SYSTEM_ERROR when the TES backend (or any other beckend component) malfunctions

uniqueg commented 2 years ago

Thanks @patmagee. I generally agree with your interpretation, though I thought that there is perhaps some useful distinction for the user:

However, certainly this is at best a rule of thumb. Surely there will be cases in any implementation, where re-running a SYSTEM_ERROR would not have any chance of succeeding when replaying the request, as well as cases where an EXECUTOR_ERROR might still not be the user's "fault".