download bursts - Githubissues

agroszer commented 7 months ago

Hi,

I have here 6 calendars defined in HA. Served by the same server, different URLs. Looking at the calendar server logs, HA hammers on the server. Downloads the same calendar ics 2-6 times within the same second. Looks like the download count is specific to the calendar (e.g. no1 requests 3 times, no2 requests 4 times). Interval between the bursts is 15 mins (despite the configured 7-8 mins, but I'm not complaining about that). I have a bunch of automations checking the calendars events.

Any idea how to debug this on the HA side, why the bursts get triggered?

franc6 commented 7 months ago

Do your automations run simultaneously? Is a single automation allowed to run multiple times, simultaneously? How do your automations check calendar events? It seems possible that those automations are causing HA to request an update of calendar events for the same calendar from multiple automations at the same time. There's no protection against that in my code, and I'm guessing HA doesn't protect against it, either, since automations tend to be "special". If the answer to either of those first two questions is yes, I'll have to figure out how HA is running them simultaneously, in order to protect against calls to update data correctly (different techniques require different protections, and the sledge-hammer approach that would handle them all is a bad idea). If you answered no to both, then the answer to the third question becomes critical. :)

Regarding the interval, it's a limitation inside HA. As a general rule, HA will not request updates (from anything) more frequently than once every 15 minutes. There are some exceptions to that, and they're difficult to explain or document (and I doubt I've encountered them all). Setting a value of less than 15 minutes will allow those exceptions to trigger actual downloads of data. For most users, that cause more harm than good. I'm curious as to why you've set it more frequently, though. If you are triggering automations on calendar entries, and updating those entries every 7 - 8 minutes, it seems to me that whatever is updating the entries will be automated, and there's probably a better solution for your HA automations; e.g. you can trigger an automation using the HA API. If you're just trying to catch some last-minute calendar item to trigger an automation, well, there still might be a better solution. :)

agroszer commented 6 months ago

Do your automations run simultaneously?

I guess so, at least nothing stops them from doing so and there are plenty to query the same calendar.

Some are time_pattern triggered, some are calendar event triggered.

Is a single automation allowed to run multiple times, simultaneously?

AFAICS not, all have mode: single

How do your automations check calendar events?

Besides the triggers some actions have

  - service: calendar.list_events
    data:
      duration:
        hours: "{{ states('input_number.room_poweron_hours') }}"
        minutes: 0
        seconds: 0
    response_variable: agenda
    target:
      entity_id: !input calendar

with some sort of duration

agroszer commented 6 months ago

Here is an excerpt from the server access log, uninteresting parts removed. mbX is the calendar name.

[06/Mar/2024:17:57:02 +0000] "GET mb8 HTTP/1.1" 200 1571
[06/Mar/2024:17:57:02 +0000] "GET mb8 HTTP/1.1" 200 1571
[06/Mar/2024:17:57:04 +0000] "GET mb5 HTTP/1.1" 200 1756
[06/Mar/2024:17:57:04 +0000] "GET mb5 HTTP/1.1" 200 1756
[06/Mar/2024:17:57:04 +0000] "GET mb5 HTTP/1.1" 200 1756
[06/Mar/2024:17:57:04 +0000] "GET mb5 HTTP/1.1" 200 1756
[06/Mar/2024:17:57:04 +0000] "GET mb1 HTTP/1.0" 200 1570
[06/Mar/2024:17:57:04 +0000] "GET mb1 HTTP/1.0" 200 1570
[06/Mar/2024:17:57:04 +0000] "GET mb1 HTTP/1.0" 200 1570
[06/Mar/2024:17:57:04 +0000] "GET mb1 HTTP/1.0" 200 1570
[06/Mar/2024:17:57:04 +0000] "GET mb1 HTTP/1.0" 200 1570
[06/Mar/2024:17:57:06 +0000] "GET mb6 HTTP/1.1" 200 1756
[06/Mar/2024:17:57:06 +0000] "GET mb6 HTTP/1.1" 429 0
[06/Mar/2024:17:57:06 +0000] "GET mb6 HTTP/1.1" 429 0
[06/Mar/2024:17:57:06 +0000] "GET mb6 HTTP/1.1" 429 0
[06/Mar/2024:18:05:01 +0000] "GET mb3 HTTP/1.1" 200 1570
[06/Mar/2024:18:05:01 +0000] "GET mb3 HTTP/1.1" 200 1570
[06/Mar/2024:18:05:01 +0000] "GET mb3 HTTP/1.1" 200 1570
[06/Mar/2024:18:05:02 +0000] "GET mb8 HTTP/1.1" 200 1571
[06/Mar/2024:18:05:11 +0000] "GET mb1 HTTP/1.0" 200 1570
[06/Mar/2024:18:05:12 +0000] "GET mb6 HTTP/1.1" 200 1756
[06/Mar/2024:18:06:01 +0000] "GET mb2 HTTP/1.0" 200 1942
[06/Mar/2024:18:06:01 +0000] "GET mb2 HTTP/1.0" 200 1942
[06/Mar/2024:18:06:01 +0000] "GET mb2 HTTP/1.0" 200 1942
[06/Mar/2024:18:12:01 +0000] "GET mb3 HTTP/1.1" 200 1570
[06/Mar/2024:18:12:04 +0000] "GET mb5 HTTP/1.1" 200 1756
[06/Mar/2024:18:12:04 +0000] "GET mb5 HTTP/1.1" 200 1756
[06/Mar/2024:18:12:04 +0000] "GET mb5 HTTP/1.1" 200 1756
[06/Mar/2024:18:12:04 +0000] "GET mb5 HTTP/1.1" 200 1756
[06/Mar/2024:18:15:02 +0000] "GET mb8 HTTP/1.1" 200 1571
[06/Mar/2024:18:15:02 +0000] "GET mb8 HTTP/1.1" 200 1571
[06/Mar/2024:18:15:02 +0000] "GET mb8 HTTP/1.1" 200 1571
[06/Mar/2024:18:15:02 +0000] "GET mb8 HTTP/1.1" 200 1571
[06/Mar/2024:18:18:01 +0000] "GET mb2 HTTP/1.0" 200 1942
[06/Mar/2024:18:18:01 +0000] "GET mb2 HTTP/1.0" 200 1942
[06/Mar/2024:18:18:01 +0000] "GET mb2 HTTP/1.0" 200 1942
[06/Mar/2024:18:20:01 +0000] "GET mb3 HTTP/1.1" 200 1570
[06/Mar/2024:18:20:01 +0000] "GET mb3 HTTP/1.1" 200 1570
[06/Mar/2024:18:20:01 +0000] "GET mb3 HTTP/1.1" 200 1570
[06/Mar/2024:18:21:11 +0000] "GET mb1 HTTP/1.0" 200 1570
[06/Mar/2024:18:21:12 +0000] "GET mb6 HTTP/1.1" 200 1756
[06/Mar/2024:18:24:02 +0000] "GET mb8 HTTP/1.1" 200 1571
[06/Mar/2024:18:24:02 +0000] "GET mb8 HTTP/1.1" 200 1571
[06/Mar/2024:18:27:01 +0000] "GET mb2 HTTP/1.0" 200 1942
[06/Mar/2024:18:27:01 +0000] "GET mb2 HTTP/1.0" 200 1942
[06/Mar/2024:18:27:01 +0000] "GET mb2 HTTP/1.0" 200 1942
[06/Mar/2024:18:27:01 +0000] "GET mb3 HTTP/1.1" 200 1570
[06/Mar/2024:18:27:04 +0000] "GET mb5 HTTP/1.1" 200 1756
[06/Mar/2024:18:27:04 +0000] "GET mb5 HTTP/1.1" 200 1756
[06/Mar/2024:18:27:04 +0000] "GET mb5 HTTP/1.1" 200 1756
[06/Mar/2024:18:27:04 +0000] "GET mb5 HTTP/1.1" 200 1756
[06/Mar/2024:18:33:02 +0000] "GET mb8 HTTP/1.1" 200 1571
[06/Mar/2024:18:33:02 +0000] "GET mb8 HTTP/1.1" 200 1571
[06/Mar/2024:18:35:01 +0000] "GET mb3 HTTP/1.1" 200 1570
[06/Mar/2024:18:35:01 +0000] "GET mb3 HTTP/1.1" 200 1570
[06/Mar/2024:18:35:01 +0000] "GET mb3 HTTP/1.1" 200 1570
[06/Mar/2024:18:35:11 +0000] "GET mb2 HTTP/1.0" 200 1942
[06/Mar/2024:18:37:11 +0000] "GET mb1 HTTP/1.0" 200 1570
[06/Mar/2024:18:37:12 +0000] "GET mb6 HTTP/1.1" 200 1756
[06/Mar/2024:18:42:01 +0000] "GET mb3 HTTP/1.1" 200 1570
[06/Mar/2024:18:42:02 +0000] "GET mb8 HTTP/1.1" 200 1571
[06/Mar/2024:18:42:02 +0000] "GET mb8 HTTP/1.1" 200 1571
[06/Mar/2024:18:42:04 +0000] "GET mb5 HTTP/1.1" 200 1756
[06/Mar/2024:18:42:04 +0000] "GET mb5 HTTP/1.1" 200 1756
[06/Mar/2024:18:42:04 +0000] "GET mb5 HTTP/1.1" 200 1756
[06/Mar/2024:18:42:04 +0000] "GET mb5 HTTP/1.1" 200 1756

Looking at the times (seconds) of the requests this sort of suggests, that some timer kicks off the requests (because the same calendar seems to trigger at the same second. The 15 min HASS limitation seem to play out most of the time.

Have the calendar defined:

ics_calendar:
  calendars:
    - name: "mb1"
      url: ...
      exclude: "['BLOCKED']"
      download_interval: 7
    - name: "mb2"
      url: ...
      exclude: "['BLOCKED']"
      download_interval: 8
    - name: "mb3"
      url: ...
      exclude: "['BLOCKED']"
      download_interval: 7
    - name: "mb5"
      url: ...
      exclude: "['BLOCKED']"
      download_interval: 8
    - name: "mb6"
      url: ...
      exclude: "['BLOCKED']"
      download_interval: 7
    - name: "mb8"
      url: ...
      exclude: "['BLOCKED']"
      download_interval: 8

franc6 commented 3 months ago

OK, I'm stumped. My best guess is that something with your automations is triggering (pun intended) a race condition in HA that's causing this. That is, multiple items are trying to trigger a check of the calendar at the same time, and so the code in HA that should prevent it from running too often doesn't quite work. I might be able to add my own lock here, but I'm not sure that it won't interfere.

agroszer commented 3 months ago

Sorry, got distracted about this. No idea when I can get back to it.

agroszer commented 3 months ago

Had some thoughts.

I don't think automations trigger calendar updates, I think (hope) there's some timer that triggers the update, but that's just simple thinking. Then the calendar events trigger automations. Also, can you give me some pointers how/where does HA limit access? Could also take a look, tho such concurrent async stuff is mindbogging (/me also a coder).

thank you

franc6 commented 3 months ago

No problem on the delay, I don't always have much time to work on this project, especially when it falls under the category of "it works for me". ;)

If you look at calendar.py, you'll see right before the update method is defined, a decoration:

@Throttle(MIN_TIME_BETWEEN_UPDATES)

If I've understood the code behind that decoration, it's essentially trying to lock a mutex, check the last time the method was called, and if it was more than MIN_TIME_BETWEEN_UPDATES, it'll unlock the mutex and execute the code. If not, it doesn't run the code. That code wasn't written by me, and I can't recall at the moment if it's from HA or a third party. That description is also from memory, so take it with a grain of salt. Also, I don't recall if there's anything to guarantee that all instances of that mutex and the variable tracking the time since the last run are guaranteed to be singletons. If they aren't, then it's possible that's the problem. It seems unlikely anybody would write the code with such a gaping hole, though, which is why I was wondering about a race condition -- IIRC, when I looked, I wasn't sure that whatever it locks on is a real mutex, and not some kind of shared lock.

Please feel free to investigate and even provide a fix. :)

agroszer commented 3 months ago

ohhhh, quick look before falling asleep... (thought experiments, without actual debugging)

The difference I see compared to core caldav is that you're doing

        event_list = []
        if await hass.async_add_executor_job(
            self._calendar_data.download_calendar
        ):

in async def async_get_events

which seems to actually do the http request, but OFC AFAICS without going through Throttle in update

I enabled debug logging, looks to me HA calls async_get_events really wildly often. Probably on automations? Got there plenty of actions querying calendars. That sometimes (no idea exactly when) ends up with downloading the calendar. update seems to be called in a timely manner OTOH.

franc6 commented 3 months ago

Right, async_get_events will not go through the Throttle; I did try that once a long time ago, and I don't recall if HA crashed or if HA just hung for long periods of time. If that's getting called multiple times for the same calendar, that could be the problem. I think I see what's going on, now. download_calendar will limit when the calendar is actually downloaded, but there's no locking in there. A good strategy might be to place a lock around checking the time, and download the data (and update _last_download) while holding the lock. It needs to keep the lock while downloading the data, so that any other simultaneous invocations will read the updated data, not the data before download, or worse, reading the data while it's still being downloaded. I think that'll fix the problem, assuming I've guessed correctly based on what you've posted.

agroszer commented 3 months ago

Here as excerpt from the log: ics_log.txt

I have plenty of automations querying the same calendar, fired on a timely basis.

The fix sounds good, I could try it before you release.

agroszer commented 3 months ago

Actually a good question is where to place the lock? All calendar downloads or calendar by calendar...

Wait-a-sec:

There is with CalendarData.opener_lock: in _download_data. That should take care of the global locking of the download and setting self._calendar_data ... but why is it not working?

franc6 commented 3 months ago

That's to protect against multiple threads changing the opener at the same time. I suspect that works. If your calendar files are small enough and the network connection fast enough, then it could download the file a few times in less than a second, which is the resolution of the logs you posted.

For a quick test to see if that's right, move the line, with CalendarData.opener_lock: in _download_data to just after the line now = hanow() in download_calendar. Adjust the indents (I hate python), so that everything in download_calendar until return False is in under the with CalendarData.opener_lock:, and unindent the stuff in _download_data.

If that resolves the problem, then I can make a proper fix later, or if you want to clean it up (variable names -- opener_lock isn't protecting just the opener anymore, add unit test(s), etc), please feel free to submit a PR. :) If you do submit a PR, please remember to run both the formatstyle.sh and test.sh scripts before committing.

agroszer commented 3 months ago

Hi,

Running now live with https://github.com/agroszer/ics_calendar, let's see what the log shows.

agroszer commented 3 months ago

Hi, Looks like the changes work fine. No single failure since installed. I need to cleanup and adjust docs, then I can provide a PR.

franc6 commented 3 months ago

Awesome! I’m glad that works. Thanks for your huge amount of help!

franc6 / ics_calendar

download bursts #129