[FEATURE] Improve reconnect handling

When running live feed stats for more than a day (usually between 16 and 36 hours), I am seeing a state that occurs where no reconnect is possible and a restart of the live feed is the only thing that will fix the issue shown in the logs before.

I am not sure if this is an issue with an underlying library (gql or another) or something that could be improved in tibber.py but an ideal scenario would be that these sort of frame and transport exceptions are handled and a full teardown and reconnect is done on the live feed thread.

Alternatively if you could recommend the best way i could catch such errors and initiate a reconnect myself that would also be helpful but I feel that it would improve tibber.py if such handling was built in.

Here are some logs of the issue that occurs. In this scenario i have an extremely high retry integer set with a 30 second retry interval:

2023-01-23 10:47:10 cerbomoticzGx: Backing off run_websocket_loop(...) for 0.1s (websockets.exceptions.ConnectionClosedError: no close frame received or sent) 2023-01-23 10:47:10 cerbomoticzGx: Retrying to subscribe with backoff. Running <bound method TibberHome.run_websocket_loop of <tibber.types.home.TibberHome object at 0xffff819234c0>> in 0.1 seconds after 1 tries. 2023-01-23 10:47:10 cerbomoticzGx: Updating home information to check if real time consumption is enabled. 2023-01-23 10:47:11 cerbomoticzGx: Subscribing to websocket. 2023-01-23 10:47:11 cerbomoticzGx: Backing off run_websocket_loop(...) for 1.7s (gql.transport.exceptions.TransportClosed: Transport is not connected) 2023-01-23 10:47:11 cerbomoticzGx: Retrying to subscribe with backoff. Running <bound method TibberHome.run_websocket_loop of <tibber.types.home.TibberHome object at 0xffff819234c0>> in 1.7 seconds after 2 tries. 2023-01-23 10:47:13 cerbomoticzGx: Updating home information to check if real time consumption is enabled. 2023-01-23 10:47:14 cerbomoticzGx: Backing off run_websocket_loop(...) for 2.0s (TypeError: catching classes that do not inherit from BaseException is not allowed) 2023-01-23 10:47:14 cerbomoticzGx: Retrying to subscribe with backoff. Running <bound method TibberHome.run_websocket_loop of <tibber.types.home.TibberHome object at 0xffff819234c0>> in 2.0 seconds after 3 tries.

(and many hours later after continuing with the retries, still no success... issue was immediately resolved by restarting the script which start the live feed and imports tibber.py...) `2023-01-23 13:25:50 cerbomoticzGx: Updating home information to check if real time consumption is enabled. 2023-01-23 13:25:51 cerbomoticzGx: Subscribing to websocket. 2023-01-23 13:25:51 cerbomoticzGx: Backing off run_websocket_loop(...) for 84.8s (gql.transport.exceptions.TransportClosed: Transport is not connected) 2023-01-23 13:25:51 cerbomoticzGx: Retrying to subscribe with backoff. Running > in 84.8 seconds after 187 tries. 2023-01-23 13:27:16 cerbomoticzGx: Updating home information to check if real time consumption is enabled. 2023-01-23 13:27:16 cerbomoticzGx: Subscribing to websocket. 2023-01-23 13:27:17 cerbomoticzGx: Backing off run_websocket_loop(...) for 98.6s (gql.transport.exceptions.TransportClosed: Transport is not connected) 2023-01-23 13:27:17 cerbomoticzGx: Retrying to subscribe with backoff. Running > in 98.6 seconds after 188 tries.`

Just to add a little more context to this... Here is an example of it behaving sort of as expected but with one exception that I mention later. What we are looking at in these logs is intermittent outages of network connectivity or API endpoint availability which occur over an 18 hour period

2023-01-23 15:26:14 cerbomoticzGx: Updating home information to check if real time consumption is enabled.
2023-01-23 15:26:15 cerbomoticzGx: Subscribing to websocket.
2023-01-23 15:26:15 cerbomoticzGx: Backing off run_websocket_loop(...) for 1.9s (gql.transport.exceptions.TransportClosed: Transport is not connected)
2023-01-23 15:26:15 cerbomoticzGx: Retrying to subscribe with backoff. Running <bound method TibberHome.run_websocket_loop of <tibber.types.home.TibberHome object at 0xffffa74ebe20>> in 1.9 seconds
 after 5 tries.
2023-01-23 15:26:16 cerbomoticzGx: Updating home information to check if real time consumption is enabled.
2023-01-23 15:26:17 cerbomoticzGx: Subscribing to websocket.
ssh: Could not resolve hostname github.com: Temporary failure in name resolution
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
error: Could not fetch origin
2023-01-23 21:06:43 cerbomoticzGx: Backing off run_websocket_loop(...) for 7.4s (gql.transport.exceptions.TransportServerError: No pong received after 10 seconds)
2023-01-23 21:06:43 cerbomoticzGx: Retrying to subscribe with backoff. Running <bound method TibberHome.run_websocket_loop of <tibber.types.home.TibberHome object at 0xffffa74ebe20>> in 7.4 seconds
 after 6 tries.
2023-01-23 21:06:51 cerbomoticzGx: Updating home information to check if real time consumption is enabled.
2023-01-23 21:06:51 cerbomoticzGx: Subscribing to websocket.
2023-01-23 21:06:54 cerbomoticzGx: Backing off run_websocket_loop(...) for 8.9s (websockets.exceptions.ConnectionClosedError: sent 1000 (OK); no close frame received)
2023-01-23 21:06:54 cerbomoticzGx: Retrying to subscribe with backoff. Running <bound method TibberHome.run_websocket_loop of <tibber.types.home.TibberHome object at 0xffffa74ebe20>> in 8.9 seconds
 after 7 tries.
2023-01-23 21:07:03 cerbomoticzGx: Updating home information to check if real time consumption is enabled.
2023-01-23 21:07:04 cerbomoticzGx: Subscribing to websocket.
2023-01-23 21:07:04 cerbomoticzGx: Backing off run_websocket_loop(...) for 14.0s (gql.transport.exceptions.TransportClosed: Transport is not connected)
2023-01-23 21:07:04 cerbomoticzGx: Retrying to subscribe with backoff. Running <bound method TibberHome.run_websocket_loop of <tibber.types.home.TibberHome object at 0xffffa74ebe20>> in 14.0 second
s after 8 tries.
2023-01-23 21:07:18 cerbomoticzGx: Updating home information to check if real time consumption is enabled.
2023-01-23 21:07:18 cerbomoticzGx: Subscribing to websocket.

At 21:07 connectivity seems restored and the connection and subscription to the websocket is restored. However, what i find a bit strange (notice that the same thing happens at around 15.26 earlier in the day and that my simple gitOps controller also had issues reaching github at the same time) is that the retry count does not reset between "incidents". This means that a retry count set to int(3), for example, would allow 3 network outages/API outages over any given range of time whether that be 3 days or 3 years.

Would it make for more expected handling that if the retry count is not exceeded and the connection is restored with success, then the retry counter gets reset to zero?

Hi, sorry for the late response. I've been a bit busy the past days and will continue to be quite busy. I'll try to have a look as soon as I can 😄 Thanks a lot for bringing the issues up! It motivates me to keep improving tibber.py and work on it! Your feedback is very valuable 😁

I've tested a bit and I think I'm just going to remove the backoff library and create the retry mechanism myself. Hopefully, that should help! I've created a branch that I've attached to the issue. I'll probably have more time to look at this over the weekend. I've added an "on_exception" argument to the Home.start_live_feed(on_exception=lambda e: ...) method for now, though I doubt it will be any helpful. The on_exception method is passed to on_backoff explained here: https://github.com/litl/backoff#event-handlers.

In case it's useful info: Just caught another failure to reconnect after what appears to be a small Tibber outage... at ~22:00:

2023-02-09 22:18:17 cerbomoticzGx: Timed out when executing a query. Check your connection to the Tibber API or the Tibber API status. 2023-02-09 22:22:45 cerbomoticzGx: Timed out when executing a query. Check your connection to the Tibber API or the Tibber API status.

Then at around 08:00 in the morning i checked and it still had not reconnected and had not complained with any more error messages either. I restarted the "service" and all was fine. I would expect it to continue retrying until Tibber was back up and carry on because of the ridiculously large number in the parameters when i invoke it, which is:

home.start_live_feed(user_agent="cerbomoticzgx/0.0.1", retries=1800, retry_interval=30, exit_condition=last_data_is_stale)

Perhaps this specific error message is helpful...

I've finally implemented a manual retry counter. Could you test this version out and see if it helps the issue of reconnecting? @JoshuaDodds

The update is on the feature/exception-handling branch which is attached to this issue.

Note that I haven't touched the retry mechanism for querying data from the tibber API (tibber.update() or tibber.execute_query(...)). The last error messages you provided seem to be coming from there (found in tibber/networking/query_executor.py::QueryExecutor.execute_query).

I've finally implemented a manual retry counter. Could you test this version out and see if it helps the issue of reconnecting? @JoshuaDodds

Nice! Will test this soon and let you know... it's been really quite stable for more than a week now... Waiting for the next failure...

The update is on the feature/exception-handling branch which is attached to this issue.

Note that I haven't touched the retry mechanism for querying data from the tibber API (tibber.update() or tibber.execute_query(...)). The last error messages you provided seem to be coming from there (found in tibber/networking/query_executor.py::QueryExecutor.execute_query).

You are referencing the error i posted from when the Tibber API was down for some hours? If so... I only got two of these messages in the logs and after that nothing - but it never reconnected when the API was back... Looking at the code you referenced, I see:

def execute_query(
        self, access_token: str, query: str, max_tries: int = 1, **kwargs
    ):
        """Executes a GraphQL query to the Tibber API.
        :param access_token: The Tibber API token to use for the request.
        :param query: The query to send to the Tibber API.
        :param max_tries: The amount of attempts before giving up. Set to None for infinite tries.

Which seems to indicate that on a failed attempt to query the Tibber API it will retry once and then not again (by default settings). That would also explain why i only received two messages in the logs and why it never reconnected, right?

Yes, that's almost true. I actually named the variable max_tries to reflect the amount of attempts, not retries, it should do, so the execute_query() method should actually only attempt to execute the query a single time. The two error messages are therefore two separate calls to execute_query(). I think the only other method in the tibber library that calls execute_query() is the tibber.update() method. So you should catch those errors by surrounding tibber.update() in a try/catch clause. If you don't use tibber.update() in your code, then the error messages are probably from the live feed loop where the data gets updated for every retry to check if real time consumption is enabled (recommended by tibber according to the docs/guide). If the latter is true for your case, you shouldn't need to worry because the retry mechanism in the live feed loop will catch the exception and retry the whole process again (1. update data, 2. check if real time consumption is enabled, 3. connect to websocket, and so on...). I should look into making this more clear from the log messages.

Note that the QueryExecutor.execute_query() method (which is invoked by the tibber.update() method) is a separate thing from the live feed websocket. It's for requesting data from the Tibber API through GraphQL queries and uses aiohttp, not websockets.

I thought it would make sense that it only tries once by default, because normally when you request data from a webservice, you either get a successful response or an error. As for websockets, it makes sense to retry during outages because it's a continuous connection that should be kept up.

BeatsuDev / tibber.py

[FEATURE] Improve reconnect handling #35