ExaWorks / psij-python

MIT License
27 stars 13 forks source link

CANCELED status after cancel() #470

Open hategan opened 4 months ago

hategan commented 4 months ago

Some LRMs don't have a specific way to mark jobs that are canceled (e.g., PBS). Currently, PBS looks for an exit code of 265 (SIGKILL), but this does not seem to be reliable.

A possible solution would be to mark a job as CANCELED whenever the job is detected as ended if cancel() was previously called. This may hide some corner cases when failures unrelated to cancel() being called happen after the call and which may be relevant in some cases. However, in most normal usage situations, it should properly mark canceled jobs as canceled. We can mitigate the problem with the corner cases by logging the actual exit status from the job.

hategan commented 4 months ago

@andre-merzky, thoughts?