ecmwf / pyflow

A high level Python interface to ecFlow allowing the creation of ecFlow suites in a modular and "pythonic" way
https://pyflow-workflow-generator.readthedocs.io/en/latest/
Apache License 2.0
7 stars 7 forks source link

warn the ecflow server early. before wait #33

Open floriankrb opened 9 months ago

floriankrb commented 9 months ago

@oiffrig This may be useful to prevent task to appear running on ecflow when the are killed because of timeout. We had this issue...

floriankrb commented 7 months ago

We observed that when some processes are stuck and not responding to kill -1, then (because of the "wait") slurm eventually stops the script and ecflow never gets notified. The task appears running but is not, RETRY does not trigger and the user is just waiting. There may be other recent changes already made in pyflow code that prevent this behaviour.

I think the issue would be solved as long as we can ensure that this does not happen.

corentincarton commented 7 months ago

@floriankrb, could you try the new version we just released? We pushed a fix for the trap mechanism so it could have fixed your issue. We could work on the wait part and maybe do a sleep or something instead. Feel free to suggest something, but we can't notify ecflow the job was aborted before the job is actually done or we could have conflicting jobs if the user requeues the aborted task.