andrewsayre / pyheos

An async python library for controlling HEOS devices through the HEOS CLI Protocol
Apache License 2.0
13 stars 9 forks source link

Unstable connection with Home Assistant #20

Open olalid opened 2 years ago

olalid commented 2 years ago

pyheos version with the issue: 0.7.2

Python version and environment: 3.8 Home Assistant

Description of problem: Connection is not stable In my Home Assistant and HEOS setup the connection to the HEOS devices are not stable. Typically it will only stay connected for a period of an hour up to 24 hours, sometimes indicating that some or all devices are disconnected or still showing a connected status for all devices but actually not working. When looking at the logs I could see that an occasional command would fail (timeout) and the lib would go all haywire and create at lest 2 reconnection tasks which would eventually fail such that it thinks it is connected but actually is not, and the only way to get it to work again is to reload the integration in Home Assistant or restart Home Assistant. I came to the conclusion that this happens when a lot of commands are queued at the same time, and in this situation it seems the HEOS device will just discard a command or maybe it just has a buffer overrun. I have 11 HEOS devices in my home which means that quite a lot of commands are sometimes queued up when the HA integration tries to get status of all devices, a guess may be that it is perhaps not that common to have this many devices and that it triggers the problem more often. So I did a small patch to add a lock function such that no new command is sent before the previous one is completed. And it now looks like the problem is gone, granted that it is random so test time may need to accumulate before it is certain.

I am not sure if this is a good way to solve this, it may be a Home Assistant unique problem, so maybe it should be solved there. Also, this is a kind of a quick workaround, it may have side effects and perhaps it would be better to limit the number of commands in the queue to a low number instead of just blocking until each command is completed. So I am writing this in search of comments and discussion about this issue. Would be interesting to know if someone else has similar stability issues also...

Changes:

In connection.py:HeosConnection:__init__ function:
        self._lock = asyncio.Lock()

In connection.py:HeosConnection:command function:
        await self._lock.acquire()
        # Send command
        try:
            self._writer.write((uri + SEPARATOR).encode())
            await self._writer.drain()
            response = await asyncio.wait_for(event.wait(), self.timeout)
        except (ConnectionError, asyncio.TimeoutError, OSError) as error:
            # Occurs when the connection breaks
            asyncio.ensure_future(self._handle_connection_error(error))
            message = format_error_message(error)
            _LOGGER.debug("Command failed '%s': %s", masked_uri, message)
            raise CommandError(command, message) from error
        finally:
            self._lock.release()

Additional information:

andrewsayre commented 2 years ago

Awesome work! Do you have the logs from HA when the original problem occurs, I'd be curious to see the sequence? There seems to be a few problems here:

  1. Connection gets disconnected (why is that happening? Is the device on an unstable network? poor wifi signal?)
  2. Multiple reconnect tasks kicking off -- does this change resolve that by it self?
  3. Command flooding, which if the device's connectivity is unstable, I could see why it drops some. (Note, HA dones't poll status except during the initial connect.) I'm not sure what the result of making commands synchronous would have on perceivable performance. The ones that take a while (like pulling play lists) need to block upstream anyway.

Not opposed to merging your PR--but I want to understand the issues better. :) I have 6 HEOS devices (two amps and a drive), but all are hardwired, and haven't seen this behavior. It never looses connection in HA and always recovered during testing (or if I unplugged the device HA was connected to). What is the topology of your devices and which device is HA connected to?

olalid commented 2 years ago

I have not saved any of the original logs unfortunately, but I honestly still struggling with making it completely stable. But I think I need to revise my error description a bit since I have learned a bit more since I wrote this.

  1. The lib would typically disconnect because of that a command timed out. Since this does not anymore happen anywhere close to as often when I limit the amount of simultaneous commands I assume it is because my environment somehow triggers a bug in the HEOS device. Maybe if you have a larger number of devices. I have no reason to think it is a network problem, the device HA connects to is connected to the same wired network as HA is. All my other devices are wireless but not the one HA connects to. So I think there is a value in that part of the PR that limits the number of active commands.

  2. I am not so sure about that statement anymore, I guess I understood more of the issue after I wrote the text above. The text in the PR is more correct about the reasoning behind the changes. But when a command timed out, the lib would disconnect and reconnect (but not always). The HEOS device would accept a new connection, but it would not correctly execute the first command sent to it. Again, I assume this is because of a bug in the device, when a command fails it seems the device ends up in a state where it is just not possible to send commands to it for a few minutes, some internal service maybe crashed and is restarting or something... And this in combination with a sort of a race condition in the lib made the lib end up in a strange state causes problems. In the _connect function, the state is is set to STATE_CONNECTED just before it calls register_for_change_events(). But if register_for_change_events() fails as it would typically do in my case, the state is left as STATE_CONNECTED and the lib will not really take any further actions after that since the reconnect task believes it's work is done and the lib thinks it is connected but it has not successfully registered for changes has not started the heartbeat task and will therefore get no information about what is going on with the HEOS devices. Restarting HA, or restarting the HEOS integration would typically solve this since the HEOS device typically will accept commands again after a few minutes.

  3. I have not noticed any performance problems after my changes.

But even if things are a lot more stable with my changes, I still see problems. So I think we need to give this a bit more time...

andrewsayre commented 2 years ago

The lib would typically disconnect because of that a command timed out.

From a heartbeat command or some other command? If some other command, was it to a different player than which one the lib is connected to?

The HEOS device would accept a new connection, but it would not correctly execute the first command sent to it. Again, I assume this is because of a bug in the device, when a command fails it seems the device ends up in a state where it is just not possible to send commands to it for a few minutes, some internal service maybe crashed and is restarting or something...

Very possible. I wonder if it would make sense to introduce a reconnect delay and/or backoff to help mitigate this.

olalid commented 2 years ago

From memory I think I have seen both heartbeat and browse commands having timeouts. But I will look out for that... Problem right now is that it can take like a week before I see any issue now, so testing stuff takes a lot of time...

Yes, agree, some sort of delay seems like a good idea.

olalid commented 2 years ago

This log might be interesting, this time it did not fail, but it would have failed before I did changes, because the register_for_change_events command fails, twice before it eventually executes successfully. In this case it is a heart_beat command that times out and triggers the disconnect.

2021-11-20 01:13:59 DEBUG (MainThread) [pyheos.connection] Command executed 'heos://system/heart_beat?sequence=1165': '{'command': 'system/heart_beat', 'result': 'success', 'messa
ge': ''}'
2021-11-20 01:14:19 DEBUG (MainThread) [pyheos.connection] Command failed 'heos://system/heart_beat?sequence=1166': Command timed out
2021-11-20 01:14:19 DEBUG (MainThread) [pyheos.connection] Disconnected from 192.168.168.157: 
2021-11-20 01:14:21 DEBUG (MainThread) [pyheos.connection] Failed to reconnect to 192.168.168.157: [Errno 113] Connect call failed ('192.168.168.157', 1255)
2021-11-20 01:14:34 DEBUG (MainThread) [pyheos.connection] Failed to reconnect to 192.168.168.157: [Errno 113] Connect call failed ('192.168.168.157', 1255)
2021-11-20 01:14:46 DEBUG (MainThread) [pyheos.connection] Failed to reconnect to 192.168.168.157: [Errno 113] Connect call failed ('192.168.168.157', 1255)
2021-11-20 01:14:58 DEBUG (MainThread) [pyheos.connection] Failed to reconnect to 192.168.168.157: [Errno 113] Connect call failed ('192.168.168.157', 1255)
2021-11-20 01:15:10 DEBUG (MainThread) [pyheos.connection] Failed to reconnect to 192.168.168.157: [Errno 113] Connect call failed ('192.168.168.157', 1255)
2021-11-20 01:15:30 DEBUG (MainThread) [pyheos.connection] Command failed 'heos://system/register_for_change_events?sequence=0&enable=on': Command timed out
2021-11-20 01:15:30 DEBUG (MainThread) [pyheos.connection] Failed to reconnect to 192.168.168.157: Command timed out
2021-11-20 01:15:50 DEBUG (MainThread) [pyheos.connection] Command failed 'heos://system/register_for_change_events?sequence=0&enable=on': Command timed out
2021-11-20 01:15:50 DEBUG (MainThread) [pyheos.connection] Failed to reconnect to 192.168.168.157: Command timed out
2021-11-20 01:16:02 DEBUG (MainThread) [pyheos.connection] Command executed 'heos://system/register_for_change_events?sequence=0&enable=on': '{'command': 'system/register_for_change_events', 'result': 'success', 'message': 'sequence=0&enable=on'}'
2021-11-20 01:16:02 DEBUG (MainThread) [pyheos.connection] Connected to 192.168.168.157
2021-11-20 01:16:02 DEBUG (MainThread) [pyheos.connection] Command executed 'heos://player/get_players?sequence=1': '{'command': 'player/get_players', 'result': 'success', 'message': ''}'
olalid commented 2 years ago

Ok, I have now been running the patch in the PR for quite a while... And I have not seen any cases where connection is lost completely anymore. The only thing that occasionally happens is that when the heos device crashes and restarrts it is not anymore logged in. But this is a different problem, so from my point of view I think the PR is ok as it is.

olalid commented 2 years ago

Actually, just found and fixed a bug I introduced that could cause a dead-lock during connection process. So big improvement here. Hopefully that solves the very last issues I have been having with recovery from disconnect.

olalid commented 1 year ago

@andrewsayre any possibility to make a new release soon? For me the patch I made is really changing the stability a lot. Without it typically I get less than 24 hours before I need to restart, but with it, it typically never fails unless there is an actual problem with on of the HEOS devices. To run the Home Assistant integration I need to apply this patch every time I upgrade. So would be great to get a new release done.

Johnwulp commented 1 year ago

Hi @olalid,

I think I have a similar issue, but i'm not sure. I have two HEOS systems which i both can't connect to. During the setup of the integration I get an error that there is no connection. Both my HA installtion and wired HEOS system are on the same subnet.

If i enable the debug logging i'm only seeing these 4 log entry's during the setup of the integration:

2023-05-08 22:02:02.156 DEBUG (MainThread) [pyheos.connection] Received response with no pending command: ''
2023-05-08 22:02:12.151 DEBUG (MainThread) [pyheos.connection] Command failed 'heos://system/register_for_change_events?sequence=0&enable=on': Command timed out
2023-05-08 22:02:12.151 DEBUG (MainThread) [pyheos.connection] Disconnected from 192.168.2.66
2023-05-08 22:02:12.152 DEBUG (MainThread) [pyheos.connection] Disconnected from 192.168.2.66: 

Could your fix resolve the issue that i'm having?

olalid commented 1 year ago

I think the error you are seeing here is probably not solved by this patch... I have seen the same thing happening recently in my setup too. When it has happened it has helped me to restart (all) my HEOS devices.

DannyvdSluijs commented 7 months ago

@olalid I mentioned this issue on the HomeAssistant repo as there seems to be a link. I have the same as @Johnwulp mentioned and restarting didn't resolve the issue for me. Not sure how I can help this but perhaps you have some things you want me to check out if that is helpful.

olalid commented 7 months ago

When the issue @Johnwulp mentions happens to me I need to restart not the HEOS device that I am trying to connect to (in fact that never helps), but some other random HEOS device on my network. Usually I just reboot them all because it is impossible to figure out which one until you reboot that specific one. I do not really think this is something we can do much about, to me it looks like a bug in the HEOS devices. If you try to connect to the HEOS device manually when the system is in this state it allows connection, but it immediately disconnects again as soon as you send any command.

I made some script to reboot all my HEOS devices like this: heos_reset_all:

ssdp-search | grep -i "urn:schemas-denon-com:device:ACT-Denon:1" | gawk 'match($0, /.*location=http:..(.*):.*/, a) {print "source heos_reset " a[1]}' | bash

heos_reset:

telnet $1 1255 <<EOF
heos://system/reboot
EOF

Maybe a service to do a reboot of all devices could be added to the HA component to allow for more easy recovery from this state?

olalid commented 4 months ago

@andrewsayre if you are unable to maintain pyheos lib anymore, any chance of allowing others to continue?

andrewsayre commented 3 months ago

@andrewsayre if you are unable to maintain pyheos lib anymore, any chance of allowing others to continue?

Sure, I'd be happy to add additional maintainers. Are you up for it? 😁

olalid commented 3 months ago

That was what I was implying, yes :)

imbaczek commented 2 months ago

Having just made a very minimal hack inside my HA container I'd be ecstatic to see a new release cut out! Please please please make it happen. I have to restart HA every couple days to avoid these:

 Error executing service: <ServiceCall media_player.media_stop (c:01HW64JQZ0M3RYC2E9JE4CT989): entity_id=['media_player.bedroom']>
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/core.py", line 2558, in _run_service_call_catch_exceptions
    await coro_or_task
  File "/usr/src/homeassistant/homeassistant/core.py", line 2579, in _execute_service
    return await target(service_call)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/service.py", line 971, in entity_service_call
    single_response = await _handle_entity_call(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/service.py", line 1043, in _handle_entity_call
    result = await task
             ^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/heos/media_player.py", line 104, in wrapper
    await func(*args, **kwargs)
  File "/usr/src/homeassistant/homeassistant/components/heos/media_player.py", line 202, in async_media_stop
    await self._player.stop()
  File "/usr/local/lib/python3.12/site-packages/pyheos/player.py", line 262, in stop
    await self.set_state(const.PLAY_STATE_STOP)
  File "/usr/local/lib/python3.12/site-packages/pyheos/player.py", line 250, in set_state
    await self._commands.set_player_state(self._player_id, state)
  File "/usr/local/lib/python3.12/site-packages/pyheos/command.py", line 55, in set_player_state
    await self._connection.command(const.COMMAND_SET_PLAY_STATE, params)
  File "/usr/local/lib/python3.12/site-packages/pyheos/connection.py", line 273, in command
    self._writer.write((uri + SEPARATOR).encode())
    ^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'write'