DIRACGrid / DIRAC

DIRAC Grid
http://diracgrid.org
GNU General Public License v3.0
113 stars 174 forks source link

Incorrect error strings reported for end-user application failures #7776

Closed sfayer closed 2 weeks ago

sfayer commented 3 weeks ago

Hi,

We've had a long running annoyance where the ApplicationStatus reported back for failed user jobs has the wrong error message. For example, if a user job script runs an invalid command, bash will exit with error code 127; the error displayed in the ApplicationStatus field will be "Key has expired ( 127 : submit Exited With Status 127)". A similar thing happens if the user payload returns error code 1 for any reason "Operation not permitted ( 1 : submit Exited With Status 1)". This causes user confusion "what key has expired? What am I not permitted to do?" but in both of these cases these are the wrong error messages; the bash exit code has been processed by strerror as if it's an errno value, but it isn't.

I think this is happening here, the application exit code is included in a RuntimeException: https://github.com/DIRACGrid/DIRAC/blob/b627e055f1e6905eafb2a8bb667ff8759a7bf969/src/DIRAC/Workflow/Modules/Script.py#L134 and then this gets set as the error number in D_ERROR which runs strerror on it: https://github.com/DIRACGrid/DIRAC/blob/b627e055f1e6905eafb2a8bb667ff8759a7bf969/src/DIRAC/Workflow/Modules/ModuleBase.py#L142

Would it be possible to somehow prevent these exit codes getting processed into errno style error messages?

Regards, Simon

fstagni commented 2 weeks ago

@arrabito do you have the same issue?

chrisburr commented 2 weeks ago

This was added for https://github.com/DIRACGrid/DIRAC/pull/3394 we probably don't need this option any more as we don't use in LHCb anymore.

arrabito commented 2 weeks ago

Yes we have the same issue.