Appropriate retry mechanism when Zaptec cloud is not working

sveinse commented 10 months ago

Discussion: What is the appropriate action when Zaptec cloud is not responding or returns error?

Currently the Zaptec integration will log them to the HA system log. This can easily be interpreted that its the zaptec integration which is at fault. Typical errors that are observed are timeouts and HTTP error code 500 (internal server error). When this happens, quite many errors and warnings are logged.

The Home Assistant integration Quality scale states that a integration should (for silver):

Handles internet unavailable. Log a warning once when unavailable, log once when reconnected.
Handles device/service unavailable. Log a warning once when unavailable, log once when reconnected.

So what should we do? Do we need to implement some way to determine Zaptec cloud's "healthyness" in order for our integration to reduce the error logging?

svenakela commented 10 months ago

Could be solved by setting an attribute on the binary_sensor when an API call fails. Then log the problem. When the API is working again the time stamp can be cleared and a reconnected is logged. That would make it easy to visualize if the API is down as well.

I've had some issue last two nights with charging stopped several times and I suspect APIs. Could have been detected with an error time stamp.

Bluhme1 commented 9 months ago

Agree. Zaptec API has been unstable during the last few days. Error 500. Has given a lot of errors in the log

sveinse commented 9 months ago

Interestingly, there already exists entities for this: The " Charger", " Circuit" and "* Installation" entities. This entity exists to provide a place to store the legacy attributes and only reads "Connected". But it would be a very useful extension to this that they indeed show the status of the API connection. In hindsight it kinda seems obvious to have this feature.

Definitely something to put on the list! Thanks.

Bluhme1 commented 9 months ago

Still getting Time Outs like:

Denne fejl stammer fra en brugerdefineret integration.

Logger: custom_components.zaptec.api Source: custom_components/zaptec/api.py:855 Integration: Zaptec EV charger (documentation) First occurred: 16.52.35 (1 occurrences) Last logged: 16.52.35

Request to https://api.zaptec.com/api/installation/bf68a41d-d2f9-44f5-bfc6-20bf42e645a8 failed (attempt 1): TimeoutError:

Perhaps the Zaptec integration should be more patient before reporting an error. After a short while the connection is established

Bluhme1 commented 6 months ago

Hi @sveinse

Did you get any further in your considerations to solve this issue. A pity that a very useful integration produces so many errors

JTT22 commented 3 months ago

How do we get someone working on this my logs are unusable...

Hellowlol commented 3 months ago

Do you have api debug on? Anyway, it think the proper way way here so is to retry using a librays

JTT22 commented 3 months ago

I get mostly just timeout errors or the below thats flooding:

custom_components.zaptec.api.RequestError: GET request to https://api.zaptec.com/api/circuits/KEYREMOVED failed with status 500: <ClientResponse(https://api.zaptec.com/api/circuits/KEYREMOVED) [500 Internal Server Error]> <CIMultiDictProxy('Date': 'Tue, 30 Jul 2024 08:32:55 GMT', 'Content-Length': '0', 'Connection': 'keep-alive', 'Request-Context': ‘appId=cid-KEYREMOVED', 'Strict-Transport-Security': 'max-age=15724800; includeSubDomains')>

sveinse commented 3 months ago

I'm sorry, I haven't had time to look at this.

Currently the design logs every time an API request to Zaptec cloud fails, so when the connection is unstable it will result in many log entries. I agree that this should be changed. The HA quality scale requires for Silver that "Handles device/service unavailable. Log a warning once when unavailable, log once when reconnected."

sveinse commented 3 months ago

@Bluhme1, @JTT22 Thank you for your patience. I've merged a fix into master that should reduce the logging to a minimum. Can you please test the "master" version and see if this fixed for you, please?

Hellowlol commented 3 months ago

The best way to deal with this is to add retry on 500 errors so you sjampoen the exception and just retry later. Can someone post the entire blueprint?

sveinse commented 3 months ago

@Hellowlol In my experience, when Zaptec cloud starts giving 500, it will do so for quite a while. Nevertheless, I created #111 which will retry up to 5 times before giving up. What do you think? If this makes sense I can merge it to master and we can have it tested by the users that are most prone to this problem.

The main objective of this issue was to control the extensive logging that the API was producing when Zaptec cloud fails. We can't really do anything about availability of the Zaptec cloud. With the fix in #110, the API will no longer log itself, but rather leave it to the HA UpdateCoordinator to handle available/not-available messages.

svenakela commented 3 months ago

I nice thing to do on retries is an exponential delay until next retry. Otherwise all clients will spam the API when it least need a flood of requests. It can be done with the backoff decorator for example, or by code.

sveinse commented 3 months ago

Yes, that is a good idea. However, there are limits to how long time these backoffs may be. A full poll from the HA zaptec integration to the Zaptec cloud requires quite many requests. By default this happens every 60 seconds, and any backoffs that enters delays into minutes, will probably interfere with HAs update interval. I'm not sure how the HA UpdateCoordinator responds to that.

sveinse commented 1 month ago

The PR has now been updated with an exponential backoff mechanism.

sveinse commented 1 month ago

A fix for the issue have been pushed to master. Please install the "master" version and test if this fixes the observed problems. @svenakela @Bluhme1 @JTT22

sveinse commented 1 month ago

It seems HACS have removed the option of downloading "master", so I've created a beta/pre-release "0.7.2b1" available for download and test.

custom-components / zaptec

Appropriate retry mechanism when Zaptec cloud is not working #90