Open hjoliver opened 5 years ago
PT15M default: Yes. This ensures that tasks are health checked regularly - no chance of task states not being picked up on hard kill, dodgy messaging, etc.
Infinite repeat: Yes. Same here - just in case the exit sequence of a job is stuck for whatever reason.
Also correct on time out. The execution timeout duration is the sum of the execution time limit and the specified polling intervals.
Logic is pretty crazy here:
PT15M default: Yes...
Good (but undocumented).
Infinite repeat: Yes...
Good (but undocumented).
Also correct on time out. The execution timeout duration is the sum of the execution time limit and the specified polling intervals.
Ah, so that's literally when the timeout event goes off. It's less clear to me that this is good. What if the suite explicitly sets [events]execution timeout
- should that be overridden by this computed value?
Execution time limit based value overrides execution timeout.
Execution time limit based value overrides execution timeout.
Hmmm. Yes, by experiment, execution time limit
gets translated directly to PBS wall_time
(for example), but the computed "timeout" value of execution time limit
+ specified execution timeout polling intervals
overrides an explicitly-set [events]execution timeout
value. I think that's what you've just said. But the question is, is that desirable?? If I set an execution timeout, surely that's when I want the timeout event to occur. I'm not sure that timeout should be automatically modified by polling intervals continuing beyond that.
This goes back to when we first considered the implementation of execution time limit
. We argued that if execution time limit
is specified, then the execution timeout
setting should become meaningless.
In the current logic, time out should occur when the batch system terminates on job for exceeding its wall clock limit - but we give it some extra time with some polling intervals to allow a busy batch system to terminate the job and clean up. On exceeding the wall clock, the job should have returned a signalled message or our polling would have picked up the mess.
The system emits a time out event when the batch system is still detected to be holding on to our job X minutes after it should have terminated it - it tells the user that something is not quite right with the system and may require human intervention.
OK, I'm happy with that (and vaguely recall that discussion now) - a very clear explanation, thanks. But we had better document all this, as it ain't exactly obvious. (We can pretty much cut-n-paste your previous comment!).
Undocumented features of task job polling:
6*PT5M
has exactly the same effect asPT5M
@matthewrmshin - can you a) confirm this behaviour; and b) confirm that it is intentional? (If so, we just need to document it)
Also, the logged health-check info seems to be misleading or meaningless in light of the above behaviour. E.g. for
execution time limit = PT10S
andexecution timeout polling intervals = 6*PT5S
:where PT40S is PT10S plus 6*PT5S. Misleading because the "execution timeout is really still PT10S (e.g. as translated to PBS wall time directive) and in any case (as explained above) polling will not actually stop at PT40S - it will keep on polling until the job polls as finished. So what does this PT40S represent exactly? (Or do we actually set the timeout event to go off at PT40S in this case?)