BioSehnsucht / ha-elkm1

Home Assistant component, platforms for Elk M1 Gold and similar alarm / integration panels
13 stars 2 forks source link

Reconnect failed #13

Closed plouj closed 6 years ago

plouj commented 6 years ago

I was playing around with the gwww-elkm1-lib branch (83f799e34165) and ran into this error when my m1 restarted:

Jul 02 17:06:13 hassbian hass[1375]: 2018-07-02 17:06:13 INFO (MainThread) [homeassistant.core] Bus:Handling <Event state_changed[L]: old_state=<state sensor.elkm1_panel_001=Normal; friendly_name=ElkM1, Real Time Clock=4702132020718110, Elk M1 Version=05030A, Elk M1XEP Version=05030A, ElkRP=Disconnected @ 2018-07-02T13:01:57.062109-04:00>, new_state=<state sensor.elkm1_panel_001=Normal; friendly_name=ElkM1, Real Time Clock=1703132020718110, Elk M1 Version=05030A, Elk M1XEP Version=05030A, ElkRP=Disconnected @ 2018-07-02T13:01:57.062109-04:00>, entity_id=sensor.elkm1_panel_001>
Jul 02 17:06:30 hassbian hass[1375]: 2018-07-02 17:06:30 INFO (MainThread) [homeassistant.core] Bus:Handling <Event state_changed[L]: old_state=<state sun.sun=above_horizon; next_setting=2018-07-03T01:03:03+00:00, next_noon=2018-07-02T17:21:00+00:00, next_dawn=2018-07-03T09:03:40+00:00, azimuth=169.89, friendly_name=Sun, next_dusk=2018-07-03T01:38:58+00:00, next_rising=2018-07-03T09:39:32+00:00, next_midnight=2018-07-03T05:21:20+00:00, elevation=68.85 @ 2018-07-02T13:01:54.842223-04:00>, new_state=<state sun.sun=above_horizon; next_setting=2018-07-03T01:03:03+00:00, next_noon=2018-07-02T17:21:00+00:00, next_dawn=2018-07-03T09:03:40+00:00, azimuth=170.52, friendly_name=Sun, next_dusk=2018-07-03T01:38:58+00:00, next_rising=2018-07-03T09:39:32+00:00, next_midnight=2018-07-03T05:21:20+00:00, elevation=68.88 @ 2018-07-02T13:01:54.842223-04:00>, entity_id=sun.sun>
Jul 02 17:06:32 hassbian hass[1375]: 2018-07-02 17:06:32 ERROR (MainThread) [homeassistant.core] Error doing job: Exception in callback Elk._reconnect()
Jul 02 17:06:32 hassbian hass[1375]: Traceback (most recent call last):
Jul 02 17:06:32 hassbian hass[1375]:   File "/usr/lib/python3.5/asyncio/events.py", line 126, in _run
Jul 02 17:06:32 hassbian hass[1375]:     self._callback(*self._args)
Jul 02 17:06:32 hassbian hass[1375]:   File "/srv/homeassistant/lib/python3.5/site-packages/elkm1/elk.py", line 93, in _reconnect
Jul 02 17:06:32 hassbian hass[1375]:     coro = self.connect()
Jul 02 17:06:32 hassbian hass[1375]:   File "/srv/homeassistant/lib/python3.5/site-packages/elkm1/elk.py", line 90, in connect
Jul 02 17:06:32 hassbian hass[1375]:     self.loop.run_until_complete(self._connect())
Jul 02 17:06:32 hassbian hass[1375]:   File "/usr/lib/python3.5/asyncio/base_events.py", line 454, in run_until_complete
Jul 02 17:06:32 hassbian hass[1375]:     self.run_forever()
Jul 02 17:06:32 hassbian hass[1375]:   File "/usr/lib/python3.5/asyncio/base_events.py", line 408, in run_forever
Jul 02 17:06:32 hassbian hass[1375]:     raise RuntimeError('This event loop is already running')
Jul 02 17:06:32 hassbian hass[1375]: RuntimeError: This event loop is already running
Jul 02 17:06:32 hassbian hass[1375]: 2018-07-02 17:06:32 INFO (MainThread) [homeassistant.core] Bus:Handling <Event system_log_event[L]: exception=Traceback (most recent call last):
Jul 02 17:06:32 hassbian hass[1375]:   File "/usr/lib/python3.5/asyncio/events.py", line 126, in _run
Jul 02 17:06:32 hassbian hass[1375]:     self._callback(*self._args)
Jul 02 17:06:32 hassbian hass[1375]:   File "/srv/homeassistant/lib/python3.5/site-packages/elkm1/elk.py", line 93, in _reconnect
Jul 02 17:06:32 hassbian hass[1375]:     coro = self.connect()
Jul 02 17:06:32 hassbian hass[1375]:   File "/srv/homeassistant/lib/python3.5/site-packages/elkm1/elk.py", line 90, in connect
Jul 02 17:06:32 hassbian hass[1375]:     self.loop.run_until_complete(self._connect())
Jul 02 17:06:32 hassbian hass[1375]:   File "/usr/lib/python3.5/asyncio/base_events.py", line 454, in run_until_complete
Jul 02 17:06:32 hassbian hass[1375]:     self.run_forever()
Jul 02 17:06:32 hassbian hass[1375]:   File "/usr/lib/python3.5/asyncio/base_events.py", line 408, in run_forever
Jul 02 17:06:32 hassbian hass[1375]:     raise RuntimeError('This event loop is already running')
Jul 02 17:06:32 hassbian hass[1375]: RuntimeError: This event loop is already running
Jul 02 17:06:32 hassbian hass[1375]: , timestamp=1530551192.0929613, message=Error doing job: Exception in callback Elk._reconnect(), source=/srv/homeassistant/lib/python3.5/site-packages/homeassistant/core.py, level=ERROR>

A restart of HA fixed this for me, but until then the web UI wouldn't even load.

BioSehnsucht commented 6 years ago

This may be fixed in latest - there were some improvements to handling reconnection when Elk is not reachable in Gwww lib, and I only just today updated the requirements to use that version.

I'll try to test this tomorrow to see if I can reproduce the problem on the old version / test if it's fixed on the new version.

gwww commented 6 years ago

This should be fixed in elkm1 lib (quite a while ago actually). Without a elkm1 lib version it is hard to tell. I suggest closing this and deal with it again if we see it.

BioSehnsucht commented 6 years ago

I have just tested this as fixed (as far as crashing / etc) for both rebooting M1 and also the M1XEP, at least as of the current commit on master. As such, I am closing this issue.

However, if the M1XEP goes away the TCP connection may not die immediately (after 5 minutes I gave up waiting and toggled an output to force reconnect) until HASS tries to send a message to Elk at which point it will reconnect, but until that happens HASS will miss any state changes in Elk at the time they're occurring because it's no longer really connected. I created #26 to track this.

lmamakos commented 6 years ago

A TCP connection with that's idle (e.g., has no unacknowledged data in flight) has no means of detecting if the far end disappears, unless you probe the connection. Some TCP stacks have a "KEEPALIVE" option (which is really "MAKE DEAD") to probe idle connections to discover missing peers. I know that the BSD and Linux stacks can do this; don't know if the default state is on, and what the probe interval might be.

I've seen this failure myself, as I've stupidly not got my M1XEP plugged into a UPS powered outlet, and I miss events that are generated. Had I know the reconnect thing works well, I'd poke at it to provoke that rather than restarting HASS. Of course, this usually only happens when I'm traveling away..

gwww commented 6 years ago

I can use the XK message as a heartbeat. Should be able to get to that tomorrow.

/g

On Tue, 28 Aug 2018 at 20:16, Louis Mamakos notifications@github.com wrote:

A TCP connection with that's idle (e.g., has no unacknowledged data in flight) has no means of detecting if the far end disappears, unless you probe the connection. Some TCP stacks have a "KEEPALIVE" option (which is really "MAKE DEAD") to probe idle connections to discover missing peers. I know that the BSD and Linux stacks can do this; don't know if the default state is on, and what the probe interval might be.

I've seen this failure myself, as I've stupidly not got my M1XEP plugged into a UPS powered outlet, and I miss events that are generated. Had I know the reconnect thing works well, I'd poke at it to provoke that rather than restarting HASS. Of course, this usually only happens when I'm traveling away..

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BioSehnsucht/ha-elkm1/issues/13#issuecomment-416781488, or mute the thread https://github.com/notifications/unsubscribe-auth/AC6xoPgVortnn1Uh7GcRzyVYftkIRKgtks5uVd1sgaJpZM4U_yEU .

gwww commented 6 years ago

Crazy me. I pushed heartbeat code to github and new lib on PyPi. Only change to hass side of the code is requirement for 0.7.5 elkm1-lib.

I have it running on my prod Pi. We'll see how it does overnight.

On Tue, 28 Aug 2018 at 21:59, Glenn Waters gwwaters@gmail.com wrote:

I can use the XK message as a heartbeat. Should be able to get to that tomorrow.

/g

On Tue, 28 Aug 2018 at 20:16, Louis Mamakos notifications@github.com wrote:

A TCP connection with that's idle (e.g., has no unacknowledged data in flight) has no means of detecting if the far end disappears, unless you probe the connection. Some TCP stacks have a "KEEPALIVE" option (which is really "MAKE DEAD") to probe idle connections to discover missing peers. I know that the BSD and Linux stacks can do this; don't know if the default state is on, and what the probe interval might be.

I've seen this failure myself, as I've stupidly not got my M1XEP plugged into a UPS powered outlet, and I miss events that are generated. Had I know the reconnect thing works well, I'd poke at it to provoke that rather than restarting HASS. Of course, this usually only happens when I'm traveling away..

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BioSehnsucht/ha-elkm1/issues/13#issuecomment-416781488, or mute the thread https://github.com/notifications/unsubscribe-auth/AC6xoPgVortnn1Uh7GcRzyVYftkIRKgtks5uVd1sgaJpZM4U_yEU .