MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
44 stars 22 forks source link

Scheduled flows accumulate sleep overhead between long appointment delta #1214

Open andreleblanc11 opened 1 week ago

andreleblanc11 commented 1 week ago

A scheduled flow will sometimes miss appointments when there are big gaps between appointments.

# In a config
scheduled_hour 19,20
scheduled_minute 0
# In the logs
ddsr-cmc-stage01: 2024-09-06 18:55:38,766 [INFO] sarracenia.flowcb.scheduled wait_until appointment at: 2024-09-06 19:00:00+00:00, need to wait: 0:04:21.233847)
ddsr-cmc-stage01: 2024-09-06 19:00:49,033 [INFO] sarracenia.flowcb.scheduled wait_until_next sleep interrupted, returning for housekeeping.
ddsr-cmc-stage01: 2024-09-06 19:00:49,034 [INFO] sarracenia.flow _runHousekeeping on_housekeeping pid: 2150 flow/scheduled_albertaforestry_pm instance: 1
ddsr-cmc-stage01: 2024-09-06 19:00:49,036 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current cpu_times: user=1.94 system=1.5
ddsr-cmc-stage01: 2024-09-06 19:00:49,036 [INFO] sarracenia.flowcb.housekeeping.resources on_housekeeping Current mem usage: 135.6 MiB, accumulating count (2 or 2/100 so far) before self-setting threshold
ddsr-cmc-stage01: 2024-09-06 19:00:49,036 [INFO] sarracenia.flowcb.log stats version: 3.00.54rc3, started: a day ago, last_housekeeping: 310.3 seconds ago
ddsr-cmc-stage01: 2024-09-06 19:00:49,036 [INFO] sarracenia.flowcb.log stats messages received: 0, accepted: 0, rejected: 0   rate accepted: 0.0% or 0.0 m/s
ddsr-cmc-stage01: 2024-09-06 19:00:49,036 [INFO] sarracenia.flowcb.log stats files transferred: 0 bytes: 0 Bytes rate: 0 Bytes/sec
ddsr-cmc-stage01: 2024-09-06 19:00:49,036 [INFO] sarracenia.flow metricsFlowReset looking for old metrics for /local/home/sarra/.cache/sr3/metrics/flow_scheduled_albertaforestry_pm_01.json
ddsr-cmc-stage01: 2024-09-06 19:00:49,041 [INFO] sarracenia.flowcb.scheduled wait_until_next already too late to 2024-09-06 19:00:00+00:00 skipping
ddsr-cmc-stage01: 2024-09-06 19:00:49,041 [INFO] sarracenia.flowcb.scheduled wait_until appointment at: 2024-09-06 20:00:00+00:00, need to wait: 0:59:10.958606)
ddsr-cmc-stage01: 2024-09-06 19:01:19,065 [INFO] sarracenia.flowcb.scheduled wait_until_next ok 0 appointments left today
ddsr-cmc-stage01: 2024-09-06 19:01:19,065 [INFO] sarracenia.flowcb.scheduled gather time to run
ddsr-cmc-stage01: 2024-09-06 19:01:19,066 [INFO] sarracenia.flowcb.log after_accept accepted: (lag: 0.00 ) https://wildfire.alberta.ca /files/pmwx.csv
ddsr-cmc-stage01: 2024-09-06 19:01:19,069 [INFO] sarracenia.flowcb.log after_post posted to exchange: xs_PROVINCIAL topic: v02.post.files a file with baseUrl: https://wildfire.alberta.ca relPath: files/pmwx.c
sv
ddsr-cmc-stage01: 2024-09-06 19:01:19,074 [INFO] sarracenia.flowcb.scheduled update_appointments for 2024-09-07 00:00:00+00:00: ["2024-09-07 19:00:00+00:00", "2024-09-07 20:00:00+00:00"]
ddsr-cmc-stage01: 2024-09-06 19:01:19,074 [INFO] sarracenia.flowcb.scheduled wait_until appointment at: 2024-09-07 19:00:00+00:00, need to wait: 23:58:40.925420)
ddsr-cmc-stage01: 2024-09-06 19:06:29,317 [INFO] sarracenia.flowcb.scheduled wait_until_next sleep interrupted, returning for housekeeping.

I tried looking at the code to find if there was an area where the sleep interval wasn't being properly adjusted, but couldn't find anything significant.

We already have logic to adjust the total sleep time for the housekeeping interval duration. We also update the sleep time at every call of wait_until which should take into account the sleep overhead accounted in the while loop of wait_until_seconds

https://github.com/MetPX/sarracenia/blob/6f25ee6b3e6872c543232e56e3a4f3403f5df30b/sarracenia/flowcb/scheduled/__init__.py#L165-L170

Regardless, @petersilva and I agreed it would be good that the code checks for a 1 minute buffer before and after the appointment time, so that we don't skip apointments for a little bit of overhead.This should cover most cases where there is some overhead.

That would likely mean changing line 188 https://github.com/MetPX/sarracenia/blob/6f25ee6b3e6872c543232e56e3a4f3403f5df30b/sarracenia/flowcb/scheduled/__init__.py#L185-L190

reidsunderland commented 6 days ago

I refactored the scheduling to not do any sleep inside the scheduled class. It relies on the sleep in flow/__init__.py, which simplifies the scheduling because it doesn't need to deal with stopping or housekeeping.

@petersilva , @MagikEh , @andreleblanc11 and I discussed on Teams whether the scheduled next time to gather should be calculated from the current time when the gather actually runs + scheduled_interval, or from the previously scheduled time + scheduled_interval.

My example:

example to demonstrate my question: 
scheduled interval is 30s

calculating next using current time:
we run at:
00:00:02 (next scheduled for 00:00:32)
00:00:34 (next scheduled for 00:01:04)
00:01:05 (next scheduled for 00:01:35)
00:01:38 (next scheduled for 00:02:08)
--> we always poll the remote server >= 30 seconds apart. 

or calculating next using the last scheduled time:
we run at:
00:00:02 (next scheduled for 00:00:32)
00:00:34 (next scheduled for 00:01:02)
00:01:03 (next scheduled for 00:01:32)
00:01:34 (next scheduled for 00:02:02)

--> we always poll the remote server <= 30 seconds apart.

Would we rather overshoot or undershoot our interval?

@MagikEh brought up good points about what would happen if polling (or other work) takes longer than the scheduled_interval. If every poll takes longer than the scheduled interval, the next poll would always happen immediately.

If the poll takes longer than the scheduled interval occasionally, then using the last scheduled time to calculate the next gather time means that every subsequent gather after the long one would happen immediately, until the lost time is made up.

Using the current time for scheduling the next gather, the next scheduled gather after the long poll would happen immediately, but then the next gather after that would be scheduled for a future time. This gives a bit of a cool down period.


@petersilva said "An argument against precise intervals is it tends to synchronize work... that we don't want synchronized... if you have 150 polls... is there a good reason for them all to start polling at exactly the same time? It's not synchronized swimming."

Calculating the next scheduled time from the last scheduled time would cause a lot of synchronization, leading to load spikes at common intervals (30s, 1m, 15m, 30m, etc.), so it seems like calculating based on the current gather time is the better option for us.

petersilva commented 6 days ago

fwiw, what james wants already works:


scheduled_minute 2,17,32,57

will run pretty close to every fifteen minutes starting at exactly those times. This is new in 3.0.55, because prior to that you would have had to also specify scheduled_hour 0,1,2,3... (all of them.) which was a bit ugly, now it just defaults to all of them if omitted.

reidsunderland commented 1 day ago

I re-read the original description from Andre, and I think the problem was partially caused by the housekeeping needing to run at roughly the same time as the appointment. It would sleep until housekeeping was required, which coincided with when the next appointment was scheduled. So it would run housekeeping, but by the time that finished and the scheduled gather was called again, the appointment was already in the past.

But I added likely fixed because we merged #1223, that refactors the scheduling so all sleeping happens in the main flow run loop. Now the scheduled code doesn't need to worry about housekeeping at all. And if it is too late to the next scheduled appointment, it will still do the poll/gather and only "future" appointments might get skipped if they're already in the past.

petersilva commented 17 hours ago

When @andreleblanc11 gets back, he can try to reproduce the issue he had with the new version. I agree that it is likely resolved... I'm just not sure.