elad-bar / ha-edgeos

Integration with EdgeOS (Ubiquiti)
133 stars 24 forks source link

daily disconnects #79

Closed MagieMalone closed 1 year ago

MagieMalone commented 2 years ago

I get daily disconnects , reload integration fixes the issue.

running on v2.0.12

This error originated from a custom integration.

Logger: custom_components.edgeos.component.api.websocket
Source: custom_components/edgeos/component/api/websocket.py:133
Integration: Ubiquiti EdgeOS Routers (documentation, issues)
First occurred: October 18, 2022 at 19:21:02 (1 occurrences)
Last logged: October 18, 2022 at 19:21:02

Failed to connect WS, Error: 503, message='Invalid response status', url=URL('wss://192.168.3.1/ws/stats'), Line: 109
October 19, 2022
ROUTER Interface ETH0 Status was connected
08:46:50 - 14 minutes ago
ROUTER Interface ETH0 Status was connected
08:34:09 - 27 minutes ago
ROUTER Interface ETH0 Status became unavailable
08:33:59 - 27 minutes ago
October 18, 2022
ROUTER Interface ETH0 Status was connected
07:21:45 - Yesterday
ROUTER Interface ETH0 Status became unavailable
07:21:35 - Yesterday
October 17, 2022
ROUTER Interface ETH0 Status was connected
07:55:17 - 2 days ago

at the same time, ping integration to google dns keeps running during the events.

I'll add debug as well to see what causes this.

elad-bar commented 2 years ago

this happens due to service disconnections by the router, the component connectivity recovers from it after 30 seconds, in old versions of component, that warning message was not written unless it failed 3 times in row, now it logs every time it happens.

it also happened for me twice over the last week, again, this is not related to the component (can happen also with the UI of EdgeOS, but again, it recovers from it as well)

elad-bar commented 2 years ago

btw, you don't need to reload as this is part of the recovery mechanism

MagieMalone commented 2 years ago

everythings stays disconnected, until I reload. for instance eth0 graph stays at 0 for hours. so this is not the only sy,mptom of the issue I face.

MagieMalone commented 2 years ago
image
elad-bar commented 2 years ago

please provide debug log level for that, I will need to understand why it's not recovering (without additional warnings)

MagieMalone commented 2 years ago

I will, once it happens again, so I expect tomorrow.

elad-bar commented 2 years ago

thanks

elad-bar commented 2 years ago

@poonab mentioned there is a log message of Update in progress, will skip the request which can cause it, please let me know if it happens to you as well, it can be a great lead

elad-bar commented 2 years ago

@poonab, are there any log messages with the following pattern before / after it happens (under the assumption log level is set to INFO / DEBUG): WS Status changed to , API Status: ?

elad-bar commented 2 years ago

released new version with improved connectivity, set the message to e displayed only when the connection was not created, in case of disconnection, based on the error, in addition, fixed the ping message which could of cause it since it was not working.

MagieMalone commented 2 years ago
image

data is missing again from the graph but in the log I see it is still being collected. Btw, the state in Developer tools is also not changing since from around 0:00, however not much around that time that would indicate an issue.

some logs: https://pastebin.com/1AAKNK1Y

struggling to paste bigger log, as debug creates a TON :-)

elad-bar commented 2 years ago

the log you have sent is from 07:43:02, i cannot understand anything with log of one second :) please upload the entire log begins before it stopped working

elad-bar commented 2 years ago

in addition, I added 3 new sensors of messages: received, ignored and error, can you please check the received messages sensor's graph?

thanks

MagieMalone commented 2 years ago

log file:

https://wetransfer.com/downloads/24ee030959579cfb47bb74f7263d9ae220221020061726/7d89821307cdde029044ff7737bd339820221020061758/b7b95a

MagieMalone commented 2 years ago

I don't have those new sensors yet, as I still ran on 0.12 when the issue occured. I will update to it now.

elad-bar commented 2 years ago

I will try getting to it by evening, with the latest version (v2.0.13) - enough to set the logs for info and above to understnad the issue

MagieMalone commented 2 years ago

Happened again last night at around 0:00, currently on 2.0.14 home-assistant.log.gz

elad-bar commented 2 years ago

Is it happening at the same time every night?

MagieMalone commented 2 years ago

the last 2 nights, yes, at midnight, and the night before at 19:20

poonab commented 2 years ago

Mine also failed again @ 21:54:40, not happening at the same time. Errors on HA are

This error originated from a custom integration.

Logger: homeassistant
Source: custom_components/edgeos/component/api/websocket.py:158
Integration: Ubiquiti EdgeOS Routers (documentation, issues)
First occurred: 22 October 2022 at 21:54:40 (1101 occurrences)
Last logged: 13:11:36

Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/config/custom_components/edgeos/component/managers/home_assistant.py", line 83, in async_send_heartbeat
    await self.ws.async_send_heartbeat()
  File "/config/custom_components/edgeos/component/api/websocket.py", line 158, in async_send_heartbeat
    await self._ws.send_str(data)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client_ws.py", line 151, in send_str
    await self._writer.send(data, binary=False, compress=compress)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/http_websocket.py", line 690, in send
    await self._send_frame(message, WSMsgType.TEXT, compress)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/http_websocket.py", line 601, in _send_frame
    raise ConnectionResetError("Cannot write to closing transport")
ConnectionResetError: Cannot write to closing transport
This error originated from a custom integration.

Logger: custom_components.edgeos.core.managers.home_assistant
Source: custom_components/edgeos/core/managers/home_assistant.py:244
Integration: Ubiquiti EdgeOS Routers (documentation, issues)
First occurred: 20 October 2022 at 04:12:41 (55 occurrences)
Last logged: 01:17:16

Update in progress, will skip the request

Skipping through the logs I cant see anything weird included but at 21:54 all the data that continues to be logged going forward is no longer updating the values

elad-bar commented 2 years ago

That is not the same error is reported in this issue, another issue, saw it as well and this one already has a fix

elad-bar commented 2 years ago

please check v2.0.15, should improve dramtically the stability, for me 3 days without a single warning / error

thanks

MagieMalone commented 2 years ago

I upgraded to v2.0.16 yesterday, happened again last night. INFO level log doesn't show much, just that after 1s. past midnight no values change anymore. 192.168.3.9-20221029-095831.log

elad-bar commented 1 year ago

According to the log you have sent, there are updates up until the last log line: 2022-10-29 00:00:03.101 INFO (SyncWorker_30) [custom_components.edgeos.core.managers.entity_manager] ROUTER Interface ETH0 Received Traffic (sensor) updated, state: 527423484.475 | updated

in addition, i don't see a single disconnection within that log, please use debug log level to get more details for the same period, the fact it happens on dialy basis around the same time is weird, happens to me in the past because lots of load on the DNS server I ran (pi-hole), could it be that you are using HASS.IO / AdGuard / PiHole?

thanks

MagieMalone commented 1 year ago

I enabled debug yesterday, included the log from 23:58-00:02 192.168.3.9-20221030-102935.zip

I run pi-hole but not on HA, it's on a seperate VM.

elad-bar commented 1 year ago

Will review the log file later today, thanks.

Additional direction, you might have the same issue I had around a year ago when the pihole was overloaded at specific time which caused many unexplained issues related to networking, Please check how to increase the number of requests in parallel to the pihole.

MagieMalone commented 1 year ago

I added some monitoring to esxi yesterday, it seems it's not busy on the cpu's at the time:

image

also not on HA:

image

though it is currious to see that it drop cpu cycles right at that time...

elad-bar commented 1 year ago

is the second screen shot is from the component of EdgeOS CPU?

thanks

MagieMalone commented 1 year ago

no, overal HA processor usage. I'm putting an snmp probe in for the edgeos as we speak. and I'm automating the reloading for now at 00:30.

elad-bar commented 1 year ago

Can you share the screen shot of the EdgeOS CPU / Memory sensors?

Thanks

MagieMalone commented 1 year ago
image image
MagieMalone commented 1 year ago
image

the busy cpu's on esxi don't happen until 00:45 and apparently my automation is not working (yet) :-)

MagieMalone commented 1 year ago

I do get some errors in the log now:

Logger: homeassistant
Source: custom_components/edgeos/component/api/websocket.py:160
Integration: Ubiquiti EdgeOS Routers ([documentation](https://github.com/elad-bar/ha-edgeos), [issues](https://github.com/elad-bar/ha-edgeos/issues))
First occurred: 00:00:48 (491 occurrences)
Last logged: 06:49:09

Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/config/custom_components/edgeos/component/managers/home_assistant.py", line 88, in async_send_heartbeat
    await self.ws.async_send_heartbeat()
  File "/config/custom_components/edgeos/component/api/websocket.py", line 160, in async_send_heartbeat
    await self._ws.send_str(data)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client_ws.py", line 151, in send_str
    await self._writer.send(data, binary=False, compress=compress)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/http_websocket.py", line 690, in send
    await self._send_frame(message, WSMsgType.TEXT, compress)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/http_websocket.py", line 601, in _send_frame
    raise ConnectionResetError("Cannot write to closing transport")
ConnectionResetError: Cannot write to closing transport
elad-bar commented 1 year ago

If the issue resulted by the error above that's good, add validation for that error and reconnect at that point, it's available in v2.0.20, please let me know if it solved the issue.

thanks

MagieMalone commented 1 year ago

it seems the issue did not occur last night while I upgraded to v2.0.20, saw this in the log:

2022-11-02 00:00:00.667 DEBUG (MainThread) [custom_components.edgeos.component.api.websocket] Handle interfaces data
2022-11-02 00:00:01.207 INFO (MainThread) [custom_components.edgeos.component.api.websocket] Message queue is empty, will try to resample in a second
2022-11-02 00:00:01.408 DEBUG (MainThread) [custom_components.edgeos.component.api.websocket] Keep alive message sent
2022-11-02 00:00:01.408 DEBUG (MainThread) [custom_components.edgeos.component.api.websocket] Keep alive data to be sent: 69{"CLIENT_PING": "", "SESSION_ID": "1c7a652b34a44b1fbc622f296284dc88"}
2022-11-02 00:00:01.408 WARNING (MainThread) [custom_components.edgeos.component.api.websocket] Connection reset error raise by the client, will try to reconnect soon, Error: Cannot write to closing transport
2022-11-02 00:00:01.411 INFO (MainThread) [custom_components.edgeos.component.managers.home_assistant] WS Status changed to Failed, API Status: Connected
2022-11-02 00:00:01.561 DEBUG (SyncWorker_22) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:02.211 INFO (MainThread) [custom_components.edgeos.component.api.websocket] Stop listening
2022-11-02 00:00:02.211 INFO (MainThread) [custom_components.edgeos.component.managers.home_assistant] WS Status changed to NotConnected, API Status: Connected
2022-11-02 00:00:02.562 DEBUG (SyncWorker_18) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:03.562 DEBUG (SyncWorker_19) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:04.563 DEBUG (SyncWorker_46) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:05.564 DEBUG (SyncWorker_28) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:06.566 DEBUG (SyncWorker_9) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:07.567 DEBUG (SyncWorker_26) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:08.568 DEBUG (SyncWorker_4) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:09.568 DEBUG (SyncWorker_6) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:10.569 DEBUG (SyncWorker_45) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:11.569 DEBUG (SyncWorker_0) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:12.570 DEBUG (SyncWorker_7) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:13.439 DEBUG (MainThread) [custom_components.edgeos.component.api.api] Loading sys_info data
2022-11-02 00:00:13.498 DEBUG (MainThread) [custom_components.edgeos.component.api.api] Loading dhcp_stats data
2022-11-02 00:00:13.572 DEBUG (SyncWorker_30) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:13.707 DEBUG (MainThread) [custom_components.edgeos.component.api.api] Loading dhcp-leases data
2022-11-02 00:00:13.945 DEBUG (MainThread) [custom_components.edgeos.component.managers.home_assistant] Extracting API Data
2022-11-02 00:00:14.572 DEBUG (SyncWorker_21) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:15.573 DEBUG (SyncWorker_24) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:16.432 WARNING (MainThread) [homeassistant.components.sensor] Updating snmp sensor took longer than the scheduled update interval 0:00:10
2022-11-02 00:00:16.433 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.edgeos_cpu_1 is taking over 10 seconds
2022-11-02 00:00:16.434 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.edgeos_cpu_2 is taking over 10 seconds
2022-11-02 00:00:16.434 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.edgeos_cpu_3 is taking over 10 seconds
2022-11-02 00:00:16.435 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.edgeos_cpu_4 is taking over 10 seconds
2022-11-02 00:00:16.574 DEBUG (SyncWorker_28) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:17.574 DEBUG (SyncWorker_31) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:18.575 DEBUG (SyncWorker_18) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:19.575 DEBUG (SyncWorker_19) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:20.576 DEBUG (SyncWorker_30) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:21.577 DEBUG (SyncWorker_21) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:22.578 DEBUG (SyncWorker_46) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:23.579 DEBUG (SyncWorker_5) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:24.580 DEBUG (SyncWorker_47) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:25.580 DEBUG (SyncWorker_3) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:26.433 WARNING (MainThread) [homeassistant.components.sensor] Updating snmp sensor took longer than the scheduled update interval 0:00:10
2022-11-02 00:00:26.580 DEBUG (SyncWorker_37) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:27.581 DEBUG (SyncWorker_45) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:28.582 DEBUG (SyncWorker_2) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:29.583 DEBUG (SyncWorker_14) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:30.583 DEBUG (SyncWorker_12) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:31.412 DEBUG (MainThread) [custom_components.edgeos.component.api.websocket] Reinitializing WebSocket connection
2022-11-02 00:00:31.412 INFO (MainThread) [custom_components.edgeos.component.managers.home_assistant] WS Status changed to Connecting, API Status: Connected
2022-11-02 00:00:31.526 INFO (MainThread) [custom_components.edgeos.component.managers.home_assistant] WS Status changed to Connected, API Status: Connected
2022-11-02 00:00:31.526 INFO (MainThread) [custom_components.edgeos.component.api.websocket] Starting to listen connected
2022-11-02 00:00:31.526 DEBUG (MainThread) [custom_components.edgeos.component.api.websocket] Subscription data to be sent: 164{"SUBSCRIBE":[{"name":"export"},{"name":"interfaces"},{"name":"system-stats"},{"name":"discover"}],"UNSUBSCRIBE":[],"SESSION_ID":"1c7a652b34a44b1fbc622f296284dc88"}
2022-11-02 00:00:31.527 INFO (MainThread) [custom_components.edgeos.component.api.websocket] Subscribed to WS payloads
2022-11-02 00:00:31.584 DEBUG (SyncWorker_8) [custom_components.edgeos.component.managers.home_assistant] Loading entities

so fingers crossed :-)

elad-bar commented 1 year ago

really glad to see that it might got solved with latest version, in general I see within the last log you have sent that around the time the WS connection got disconnected from EdgeOS, the SNMP sensors you have didn't manage to get response from the EdgeOS as well, meaning, as I suspected before, you have an overload on the network around that time, which can be resulted by a load on the PiHole.

PiHole load is not bound to CPU, disk or memory, it's a configuration that limits the number of requests to the DNS server, usually 1k requests per minute, Usually in rich IoT environment around 12:00 AM there are lots of requests running which can lead to that kind of problem, anyway, the solution I have added is for every ping being sent (every 55 seconds) there is a validation for connectivity, if got disconnected, the integration will try to reconnect.

thanks for the details and please update how it works in the upcoming few days before closing this issue.

MagieMalone commented 1 year ago

well, the snmp got disconnected because I forgot to save the config, and I had an unrelated crash of the edgeos router :-) I will let you know how the comming days go.

MagieMalone commented 1 year ago

Too bad, I had a failure again last night again. so, my ping to google keeps running:

image

however, the connection to the router fails to connect for some reason, even though the traffic to google is using exactly the same route:

This error originated from a custom integration.

Logger: custom_components.edgeos.component.api.api
Source: custom_components/edgeos/component/api/api.py:319
Integration: Ubiquiti EdgeOS Routers (documentation, issues)
First occurred: 00:00:03 (1 occurrences)
Last logged: 00:00:03

Request failed, URL: https://192.168.3.1/api/edge/get.json, Error: Cannot connect to host 192.168.3.1:443 ssl:False [Connect call failed ('192.168.3.1', 443)], Line: 194, Retry attempt #3

as you can see from the error, I don't use DNS for the connection to the router, so I don't think it's pihole.

cpu's of both router and esxi are not high, and at least the automated reload is working now.

image
elad-bar commented 1 year ago

This error occurred because the router web API (not WebSockets persistent connection) was inaccessible, nothing the integration can do with it, As you can see, it happened once for 3 attempts and continue regularly in the next attempt.

This issue is not related to the integration but to the router it self, CPU, RAM and network traffic doesn't matter for that case as the device HTTP server is the one that not responding (Overload or restart of the service) at the same time on daily basis.

BTW if you are using the IP instead of Hostname it rules out the DNS server.

MagieMalone commented 1 year ago

This night there was no disconnect.

MagieMalone commented 1 year ago

no disconnects so far, seems this one is solved.