briancmpbll / home_assistant_custom_envoy

176 stars 76 forks source link

Data unavailable every 5-8 minutes and comes back #76

Open juetchris83 opened 1 year ago

juetchris83 commented 1 year ago

I have an issue with the data collection. It works for 5 minutes, and then it is not working for 5 to 8/10 minutes. image I reinstalled the custom integration via HACS. No changes. It is the same with the posixx integration.

I have no clue ...

catsmanac commented 1 year ago

Anything remarkable in the log file? Did you try enabling debug to see what it reports around these outages?

juetchris83 commented 1 year ago

I got errors like : 2023-05-11 07:28:55.150 ERROR (MainThread) [custom_components.enphase_envoy] Unexpected error fetching envoy Envoy 122224078067 data: 'NoneType' object has no attribute 'contents' Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 258, in _async_refresh self.data = await self._async_update_data() File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 217, in _async_update_data return await self.update_method() File "/config/custom_components/enphase_envoy/init.py", line 49, in async_update_data await envoy_reader.getData() File "/config/custom_components/enphase_envoy/envoy_reader.py", line 378, in getData await self._update() File "/config/custom_components/enphase_envoy/envoy_reader.py", line 138, in _update await self._update_from_pc_endpoint() File "/config/custom_components/enphase_envoy/envoy_reader.py", line 148, in _update_from_pc_endpoint await self._update_endpoint( File "/config/custom_components/enphase_envoy/envoy_reader.py", line 173, in _update_endpoint response = await self._async_fetch_with_retry( File "/config/custom_components/enphase_envoy/envoy_reader.py", line 197, in _async_fetch_with_retry could_refresh_cookies = await self._refresh_token_cookies() File "/config/custom_components/enphase_envoy/envoy_reader.py", line 314, in _refresh_token_cookies token_validation = soup.find("h2").contents[0] AttributeError: 'NoneType' object has no attribute 'contents' 2023-05-11 07:27:54.707 ERROR (MainThread) [custom_components.enphase_envoy] Unexpected error fetching envoy Envoy 122224078067 data: 'NoneType' object has no attribute 'contents' Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 258, in _async_refresh self.data = await self._async_update_data() File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 217, in _async_update_data return await self.update_method() File "/config/custom_components/enphase_envoy/init.py", line 49, in async_update_data await envoy_reader.getData() File "/config/custom_components/enphase_envoy/envoy_reader.py", line 388, in getData response = await self._async_fetch_with_retry( File "/config/custom_components/enphase_envoy/envoy_reader.py", line 197, in _async_fetch_with_retry could_refresh_cookies = await self._refresh_token_cookies() File "/config/custom_components/enphase_envoy/envoy_reader.py", line 314, in _refresh_token_cookies token_validation = soup.find("h2").contents[0] AttributeError: 'NoneType' object has no attribute 'contents'

catsmanac commented 1 year ago

Looks like the issue is with _refresh_token_cookies() not getting a response and not checking status. Looking at the code in envoy_reader.py there's _refresh_token_cookies which calls _async_fetch_with_retry which itself calls _refresh_token_cookies in case error 401 comes back. Wonder if such a loop gets started. If you enable debug on the envoy then you can trace all what is going in in sequence and that may shed some light on what's going on.

juetchris83 commented 1 year ago

When you say to enable envoy in debug mode, you mean in home assistant ? Or the envoy directly ? If so, I don't know how to do this.

catsmanac commented 1 year ago

You can do that in home assistant. In HA go to the integrations screen. In there click the 3 vertical dots menu button in the lower right corner of the envoy and select 'enable debug logging'. That would create debug log entries in the log files. When you later disable debugging (similar as enabling it) again it will download the log file to your computer. While debug is on you can also view current content from the log window or in the home-assistant.log file itself in the config folder.

juetchris83 commented 1 year ago

Ok so this is what I alréady done. I can put the file here, but I don't see anything explaining the issue. home-assistant_enphase_envoy_2023-05-15T14-16-51.395Z.log

catsmanac commented 1 year ago

Here's the section showing what is happening. HA is requesting for the production data and gets not authorized (401) back. Then it tries to authorize again with the envoy using the token and the check_jwt page. Then HA gets a 503 service temporary unavailable status back.

[..] Checking Token value: <token>
[..] Token is populated: <token>
[..] Token expires at: 2024-05-14 12:34:09
[..] HTTP GET Attempt #1: https://192.168.9.110/production.json: Header:{'Authorization': 'Bearer <token>
[..] Received 401 from Envoy; refreshing token, attempt 1 of 2
[..] HTTP GET Attempt #1: https://192.168.9.110/auth/check_jwt: Header:{'Authorization': 'Bearer <token>
[..] Fetched from https://192.168.9.110/auth/check_jwt: <Response [503 Service Temporarily Unavailable]>: <html>

This indicates the envoy has a crashed process or is just too busy to process the authorization request. Next section of code does not get the reply data and results in the errors you are seeing.

If you let debug on until communication resumes you probably will see that the reauthorization just works and data is collected until it happens again

Has it always been like this? Is there anything else talking to the envoy that may cause it to be busy?

catsmanac commented 1 year ago

Well in the midst of all my testing the other day, my production HA also experienced an 4 min outage of the envoy. It had 4 1 minute collection cycles failing in each it tried to read the data and then refresh the token cookies and then error-ed out. Then on 5th cycle all normally resumed. My log shows same as yours.

At the time of outage I was testing with my dev environment quite a bit and I suspect that impacted the envoy side.

2023-05-14 20:58:34.960 ERROR (MainThread) [custom_components.enphase_envoy] Unexpected error fetching envoy Envoy 122302045041 data: 'NoneType' object has no attribute 'contents'
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 250, in _async_refresh
    self.data = await self._async_update_data()
  File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 206, in _async_update_data
    return await self.update_method()
  File "/config/custom_components/enphase_envoy/__init__.py", line 49, in async_update_data
    await envoy_reader.getData()
  File "/config/custom_components/enphase_envoy/envoy_reader.py", line 388, in getData
    response = await self._async_fetch_with_retry(
  File "/config/custom_components/enphase_envoy/envoy_reader.py", line 197, in _async_fetch_with_retry
    could_refresh_cookies = await self._refresh_token_cookies()
  File "/config/custom_components/enphase_envoy/envoy_reader.py", line 314, in _refresh_token_cookies
    token_validation = soup.find("h2").contents[0]
AttributeError: 'NoneType' object has no attribute 'contents'

The error itself is caused by the fact the code has no error handling when refreshing the token cookies and not getting a proper reply.

The root cause I think is on the envoy side not responding.

BradleyCampbell commented 1 year ago

Move to a 2.4ghz only wireless network. Ever since I did that, my connection has been rock solid. When it was on the 5ghz, it would drop out several times a day.

catsmanac commented 1 year ago

I'm running wired ethernet which is solid as well. (as long as I don't use a second or third test instance of ha and even then outage are rare)

juetchris83 commented 1 year ago

The only change from my side is the update of home assistant and the change from the jrutsky to briancmpbll integration. I will do tests with a 2.4Ghz only network.

juetchris83 commented 1 year ago

Before, I only had a point every minutes. Maybe the new timing is too agressive. But I don't know how to change this.

catsmanac commented 1 year ago

The timing is hard coded in __init__.py. Look for:

SCAN_INTERVAL = timedelta(seconds=60)

Change the file and restart HA. I think there are some pages in Envoy that are slower, but it depends on your Envoy type if these are used or not. My ENVOY-S non metered doesn't use these.

juetchris83 commented 1 year ago

ok so I did some tests et none were conclusive. Wifi 2.4 only does not get the situation better. For info, I got another instance polling realtime values from the envoy with no probleme but it's a Jeedom VM, not a home assistant one (as I can't get real time values by any integration from HA.) reducing the polling is not getting the situation better. shutting down the jeedom poller is not an option as it is key in managing my power.

catsmanac commented 1 year ago

Found something similar when using a second HA. The primary one reports 503 service temporarily unavailable for 3-6 minutes and then resumes. The second one was polling at 60, 30 and 20 second during various tests. I assume your Jeedom is collecting even faster. My conclusion is that is seems the Envoy can't handle it. Almost as when collection moments are close together one breaks. Only option on the HA side I can see trying is closing the http client connection when 503 happens and restarting it again to reset the whole session. Currently the HA integration is just trying to fetch cookies from the envoy, but that hits the same 503 status for several tries.

As for your situation you might consider getting Envoy data from the Jeedom. Looks like there are some options for getting Jeedom data to HA.

Hoffmann77 commented 1 year ago

Hey @catsmanac you seem to have some experience with home-assistant. Do you know what would be the recommended way to deal with a http error like this? What should happen with a caught exception when resolving is not possible? Raise, return None values or return an error message?

Any input is appreciated.

catsmanac commented 1 year ago

Hey @catsmanac you seem to have some experience with home-assistant. Do you know what would be the recommended way to deal with a http error like this? What should happen with a caught exception when resolving is not possible? Raise, return None values or return an error message?

Any input is appreciated.

What I generally see (only experience with these Envoy integrations) is raise to signal HA this collection cycle didn't work. Current code does not really check what came back (except for 401). That causes situations like above, a stack dump at a spot that does not represent the actual error and it always clear what the root cause is. If all runs well no issue at all, and overall I think this is a fine integration but every now and then someone runs into an Envoy quirk which turns into a bit of a search.

Using something like below after getting the response still would cause the stack dump, but now at the right spot with right reason without enabling debug.

raise RuntimeError('Envoy Service temporary unavailable (503)')
Hoffmann77 commented 1 year ago

Thank you so much for your input @catsmanac. Really appreciated.