cylc / cylc-doc

Documentation (User Guide, Cheat Sheets, etc.) for the Cylc Workflow Engine.
https://cylc.github.io/cylc-doc/
GNU General Public License v3.0
9 stars 19 forks source link

polling intervals (and documentation of) #13

Open hjoliver opened 5 years ago

hjoliver commented 5 years ago

Undocumented features of task job polling:

@matthewrmshin - can you a) confirm this behaviour; and b) confirm that it is intentional? (If so, we just need to document it)

Also, the logged health-check info seems to be misleading or meaningless in light of the above behaviour. E.g. for execution time limit = PT10S and execution timeout polling intervals = 6*PT5S:

...INFO - [foo.1] -health check settings: execution timeout=PT40S, polling intervals=PT15S,5*PT5S,...

where PT40S is PT10S plus 6*PT5S. Misleading because the "execution timeout is really still PT10S (e.g. as translated to PBS wall time directive) and in any case (as explained above) polling will not actually stop at PT40S - it will keep on polling until the job polls as finished. So what does this PT40S represent exactly? (Or do we actually set the timeout event to go off at PT40S in this case?)

matthewrmshin commented 5 years ago

PT15M default: Yes. This ensures that tasks are health checked regularly - no chance of task states not being picked up on hard kill, dodgy messaging, etc.

https://github.com/cylc/cylc/blob/fa1fec873bf0a15a60a21dc3f7ca214dfbdeff3f/lib/cylc/task_events_mgr.py#L935-L963

Infinite repeat: Yes. Same here - just in case the exit sequence of a job is stuck for whatever reason.

https://github.com/cylc/cylc/blob/fa1fec873bf0a15a60a21dc3f7ca214dfbdeff3f/lib/cylc/task_events_mgr.py#L151-L174

Also correct on time out. The execution timeout duration is the sum of the execution time limit and the specified polling intervals.

Logic is pretty crazy here:

https://github.com/cylc/cylc/blob/fa1fec873bf0a15a60a21dc3f7ca214dfbdeff3f/lib/cylc/task_events_mgr.py#L917-L989

hjoliver commented 5 years ago

PT15M default: Yes...

Good (but undocumented).

Infinite repeat: Yes...

Good (but undocumented).

Also correct on time out. The execution timeout duration is the sum of the execution time limit and the specified polling intervals.

Ah, so that's literally when the timeout event goes off. It's less clear to me that this is good. What if the suite explicitly sets [events]execution timeout - should that be overridden by this computed value?

matthewrmshin commented 5 years ago

Execution time limit based value overrides execution timeout.

hjoliver commented 5 years ago

Execution time limit based value overrides execution timeout.

Hmmm. Yes, by experiment, execution time limit gets translated directly to PBS wall_time (for example), but the computed "timeout" value of execution time limit + specified execution timeout polling intervals overrides an explicitly-set [events]execution timeout value. I think that's what you've just said. But the question is, is that desirable?? If I set an execution timeout, surely that's when I want the timeout event to occur. I'm not sure that timeout should be automatically modified by polling intervals continuing beyond that.

matthewrmshin commented 5 years ago

This goes back to when we first considered the implementation of execution time limit. We argued that if execution time limit is specified, then the execution timeout setting should become meaningless.

In the current logic, time out should occur when the batch system terminates on job for exceeding its wall clock limit - but we give it some extra time with some polling intervals to allow a busy batch system to terminate the job and clean up. On exceeding the wall clock, the job should have returned a signalled message or our polling would have picked up the mess.

The system emits a time out event when the batch system is still detected to be holding on to our job X minutes after it should have terminated it - it tells the user that something is not quite right with the system and may require human intervention.

hjoliver commented 5 years ago

OK, I'm happy with that (and vaguely recall that discussion now) - a very clear explanation, thanks. But we had better document all this, as it ain't exactly obvious. (We can pretty much cut-n-paste your previous comment!).