Open nhoening opened 6 years ago
We could also simply extend the documentation to explain what is meant by "frequency" (the frequency that the pinger regards as healthy).
But that is a bit forced, the nicest way would be to call it something more speaking, like "complain-after". However, I realize this would mean some production systems to be updated over at Softwear, probably.
I am running into a problem that might stem from a smenatic misconception as for what the
frequency
attribute which the pinger expects from the task information URL.It seems to me that with
frequency
, the pinger is told that the monitored task is run each x minutes (in our case it's ten).However, once in a while, the pinger would check and find a task run was recorded at ten minutes ago plus a few seconds. He then complains about being utside of the acceptable range. The few seconds are probably network latency, or one forecasting job batch actually taking a bit more time than the job batch ten minutes earlier did.
Example log entry:
2018-09-07 20:01:52,624 ERROR Error: BVP/staging is outside of the acceptable 10 minute range. Last Run 2018-09-07 19:51:16.359826+00:00 UTC with status OK
I think if the task runs every x minutes, the pinger must allow for x + y minutes for itself to safely check if there really is a mentionable out-of-range problem there.
I set the frequency in the pinger conf to 15 minutes now in our environment.
Effectively, I propose to improve the semantics of that frequency setting, that should improve the pinger overall. Either its name changes, say to
ping-frequency
(together with an adapted documentation), or the pinger allows for ten or twenty percent extra margin (e.g.frequency
* 1.2).