Open uniqueg opened 2 years ago
@uniqueg thanks for raising this. Ambiguity here has also lead me to likely misusing these error states in different cases as well. I have typically held to the broad definition the SYSTEM_ERROR
is something related to network issues, quota (if running on a cloud), or infrastructure related problems (ie system goes down, db goes down etc), while EXECUTOR_ERROR
is pretty much everything else.
To me EXECUTOR_ERROR
has always been a bit strange, the name seems to hint at an application error as opposed to an error encountered during the running of a workflow. Thinking about this from a user's perspective, I think they broadly only care about whether their workflow completed successfully or not. A simpler team may have been failed.. but this tangent is a bit beside the point.
IMO the difference between a SYSTEM_ERROR
happening in the WES server, or in distributed TES backend should not be exposed to the user. To the user, they are belong to a single API and so should be treated in a uniform way when conveying state. To that end I think it makese sense to return SYSTEM_ERROR
when the TES
backend (or any other beckend component) malfunctions
Thanks @patmagee. I generally agree with your interpretation, though I thought that there is perhaps some useful distinction for the user:
EXECUTOR_ERROR
, it will hardly make sense for the user to try to re-run the same request (likely something is wrong with the workflow/task configuration, causing the workflow engine or a tool to fail).SYSTEM_ERROR
, they might well try to re-run, because it could be due to a temporary problem of the WES/TES or similar.However, certainly this is at best a rule of thumb. Surely there will be cases in any implementation, where re-running a SYSTEM_ERROR
would not have any chance of succeeding when replaying the request, as well as cases where an EXECUTOR_ERROR
might still not be the user's "fault".
Description of the problem
Running different WES implementations, we ran into the issue that the definition of state
EXECUTOR_ERROR
is somewhat ambiguous.The relevant states are currently defined as:
Now, the defintion of
COMPLETE
perhaps makes it fairly clear that it should only be assigned if a workflow completed successfully with no issues whatsoever (return_code
is0
). However, what is perhaps not so clear is when anEXECUTOR_ERROR
should be assigned. Would it include only failures of jobs or system errors"of the executor itself as well. TheEXECUTOR_ERROR
definition rather suggests the latter, theSYSTEM_ERROR
definition rather the former.The question becomes perhaps clearer in a WES/TES setting. Should WES return
EXECUTOR_ERROR
even if a TES job returns aSYSTEM_ERROR
(note that the definitions ofEXECUTOR_ERROR
andSYSTEM_ERROR
are the same for WES and TES)? It's arguably an error "in the executor" either way, and so by the definition ofSYSTEM_ERROR
in WES is not eligible to be assigned a WESSYSTEM_ERROR
.Personally, I feel that
SYSTEM_ERROR
s in executors should be assignedSYSTEM_ERROR
s in WES as well, because of the implications this has for users: For anEXECUTOR_ERROR
there is likely something wrong on the user's end, and the user should carefully check the workflow and run request before bothering with trying again. Whereas for aSYSTEM_ERROR
the user might want to try again and file a bug report if the error persists.Possible solution
Make it more clear what is and what isn't to be considered an
EXECUTOR_ERROR
and aSYSTEM_ERROR
on the level of WES, e.g., by including what the states signify to the client/user.