bendavid / aiopylgtv

Library to control webOS based LG Tv devices
MIT License
143 stars 47 forks source link

Connection Hangs when TV turns Off #48

Closed abates closed 2 years ago

abates commented 2 years ago

I have been using Home Assistant with the WebOS integration for awhile and it has been working well. Recently I changed how my TVs are connected to my home network and it has caused some unexpected behavior with the integration. First some background:

My TVs have builtin Google Cast capability. However, we never cast directly to the TVs so their appearance in the cast options is confusing. In an effort to have the TVs no longer appear in the list of devices to cast to I put them in their own subnet to isolate in their own broadcast domain. Since the TVs are now in a different broadcast domain than client devices, the TVs no longer show up as castable. There are no firewall rules between my Home Assistant instance (on LAN1 below) and this new subnet (LAN2 below):

┌────┐  ┌──────┐  ┌────┐
│LAN1◄──┤Router├──►LAN2│
└──┬─┘  └──────┘  └──┬─┘
   │                 │
   │                 │
┌──▼──┐            ┌─▼──┐
│ HAS │            │ TV │
└─────┘            └────┘

Now to my problem:

When the TV is turned off, either via the TV itself/remote or by calling the service from Home Assistant the underlying TCP connection that aiopylgtv opens is never closed. When the TV turns off, it never sends a FIN packet and so the TCP session is in a hung state that will never close without action. Since aiopylgtv thinks there is an active TCP connection, there is never an attempt to reconnect when the TV is turned on again by the remote control. Therefore, if the TV is operated independently from Home Assistant, Home Assistant does not reflect the correct state.

I have a hypothesis why everything works fine on the same network: The TCP session does in fact get hung always, however when on the same subnet the ARP entry will expire relatively quickly. When the ARP entry expires, the operating system automatically closes the TCP session. The next time Home Assistant polls, there is no active TCP session so one is attempted. This repeats until the TV is turned on and a new TCP session is established.

When I moved my TVs into a new subnet, ARP was no longer in play and so the operating system shows the TCP session as open (as it should since no FIN packets are sent). This problem is arguably an LG problem, since the TV should close any TCP sessions that can't ever be used again. However, I've found that if all I do is call disconnect() when the power state changes to suspend, this cleans up the underlying TCP session and everything works again as expected (see the commit following this note).

I don't know if my solution is correct, or if I'm perhaps completely missing something, but I'd love any feedback you have on my circumstances.

chros73 commented 2 years ago

Interesting, so is it happening when it subscribes to power state? If so, the solution looks good to me, if the TV signals its state change to the client. 1 more note, instead of the check you did (if "state" in self._power_state and self._power_state["state"] == "Suspend":) can you try the is_on property? (it covers more cases and it's in a central place): if not self.is_on:

It should work as well, because _power_state defaults to Unknown, so it shouldn't disconnect straight away.

Edit: and can't you just use await self.disconnect() instead of asyncio.create_task(self.disconnect())?

abates commented 2 years ago

The problem occurs whenever a connection is made. Whenever Home Assistant polls the integration, it will establish a new TCP connection if none exists (see here) . Nothing ever disconnects the session, so when the TV turns off, my Linux system shows an established TCP session in perpetuity.

The TV does send a power state change event to the client, it just never closes the TCP session. Like I said before, this really boils down to the TV not sending a FIN packet prior to powering off, but there isn't much I can do about fixing that. If there were a way to turn on TCP keepalives that might fix it as well, but my observation is that hung sessions take forever to time out from unacknowledged keep alive packets.

I've made the change to check the is_on property in my environment and will let it run a few days and let you know.

Thank you for your help.

abates commented 2 years ago

Edit: and can't you just use await self.disconnect() instead of asyncio.create_task(self.disconnect())?

So I tried this before and it causes the whole integration to lock up. Somewhere in the client code there is a race condition or deadlock that is caused by awaiting in set_power_state/disconnect. It's really weird because it was only intermittent and I haven't been able to reliably reproduce the deadlock. Async programming is not my strongest area, so I struggled to determine the cause.

chros73 commented 2 years ago

So I tried this before and it causes the whole integration to lock up. Somewhere in the client code there is a race condition or deadlock that is caused by awaiting in set_power_state/disconnect.

Can you try my fork whether the same issue exists? (Some parts were refactored, unfortunatly for you couple of methods are also changed/removed so it's not just a quick replacement).

And one thing popped into my mind: I guess ping_interval is not 0 in your case. Can ping requests go through your subnets? If so, then the same can be applied to ping_handler method as well I guess.

thecode commented 2 years ago

So I tried this before and it causes the whole integration to lock up. Somewhere in the client code there is a race condition or deadlock that is caused by awaiting in set_power_state/disconnect.

Can you try my fork whether the same issue exists? (Some parts were refactored, unfortunatly for you couple of methods are also changed/removed so it's not just a quick replacement).

And one thing popped into my mind: I guess ping_interval is not 0 in your case. Can ping requests go through your subnets? If so, then the same can be applied to ping_handler method as well I guess.

@chros73 I can publish the WebOS Home Assistant integration adjusted to your fork, I already did that and using it for few days, I had an error about StorageSqliteDict.create performing Non-thread-safe operation, I didn't give it too much importance since I intend to use another storage, but maybe we can fix this I can change the default Home Assistant library to your fork. You are correct about the ping, as far as I understand currently the ping is used to detect when the TV is off.

@abates let me know if you would like to test a library based on the bscpylgtv as a custom integration, you can install to check and remove it to get back to the official one (without need to pair the TV again), the exception above is not relevant once the TV is already paired. If you are interested I can push a version for testing to my repo.

abates commented 2 years ago

@thecode I'd be happy to test a custom integration for the WebOS integration. Let me know what repo to use.

Thanks for all the help, everyone!

chros73 commented 2 years ago

I had an error about StorageSqliteDict.create performing Non-thread-safe operation

Strange, we can look into it, thanks.

thecode commented 2 years ago

Hi @abates, I have created an updated integration based on @chros73 fork, you can find it here: https://github.com/thecode/home-assistant-custom-components/tree/main/webostv/custom_components/webostv Please install it as a custom component in Home Assistant, I have tested it on my production environment and it works good, but I am not using VLANs.

Using this as a custom component will also allow you to run using a bscpylgtv directly from @chros73 git if needed so we can change code on the fly to try to resolve the issue.

In my opinion if you still experience a problem with this version which is related to bscpylgtv lets make a new issue in the bscpylgtv repository and follow it up from there.

Many thanks to @chros73 and for you in helping to improve the Home Assistant WebOS TV integration.

chros73 commented 2 years ago

I had an error about StorageSqliteDict.create performing Non-thread-safe operation

As discussed offline, it turned out that this issue also exists with aiopylgtv and it's probably a Home Assistant issue. (Note that I don't use Home Assistant myself, I use my fork as a command line tool by itself, plus couple of small python scripts that are in the docs/scripts dir.)

@abates, let me know if your original issue still exist with my fork (I don't have subnets here). If so, then you can create a PR there (use develop branch for that) and we can carry on from there.

thecode commented 2 years ago

@abates I bumped the custom integration library to 0.2.9 which should fix the connection hang problem. Please test and let us know. https://github.com/thecode/home-assistant-custom-components/tree/main/webostv/custom_components/webostv

Thanks for the quick fix @chros73

chros73 commented 2 years ago

@abates, I implemented your fix in a slightly different way, I hope that solves the race condition you had. Also, this didn't need to be added into ping_handler, but I included the self.is_on check when sending pings.

thecode commented 2 years ago

So I have two HA instances running on the same machine, 1 is using version 0.2.9 and one is using version 0.2.8 + the reconnect fix (this was in order to test that the fix works well) unfortunately the one running 0.2.9 lost tracking of the status of the TV (meaning it shows that the TV is off when it is on).

abates commented 2 years ago

Okay, I'm running the custom integration that is using v0.2.9 of bscpylgtv, I just restarted my HA instance and will let you know what I have found. I don't know if I'll be able to play around with it tonight or not.

thecode commented 2 years ago

Okay, I'm running the custom integration that is using v0.2.9 of bscpylgtv, I just restarted my HA instance and will let you know what I have found. I don't know if I'll be able to play around with it tonight or not.

Try different scenarios, turning the TV on/off either by HA or by another device (such as the remote), there is a difference if you turn the TV via the integration or via another source.

abates commented 2 years ago

Okay, so when I turned the TV on with the remote it immediately changed state in home assistant and showed 'on'. I also saw an established TCP session between my HA instance and the TV. When I turned the TV off with the remote, the TCP connection was closed and HA state changed to off. After that, I tried turning the TV on/off through home assistant. Turning the TV on worked (using wake-on-lan for my on action) but no TCP session was established and the state never changed from 'off' to 'on'. Also, there is no exception trace in my log, so I'm really not sure what happened.

thecode commented 2 years ago

Okay, so when I turned the TV on with the remote it immediately changed state in home assistant and showed 'on'. I also saw an established TCP session between my HA instance and the TV. When I turned the TV off with the remote, the TCP connection was closed and HA state changed to off. After that, I tried turning the TV on/off through home assistant. Turning the TV on worked (using wake-on-lan for my on action) but no TCP session was established and the state never changed from 'off' to 'on'. Also, there is no exception trace in my log, so I'm really not sure what happened.

This is exactly what I have experienced, I can push a version which is based on 0.28 + a fix I did which should work

abates commented 2 years ago

Just an update from my previous comment:

Now turning on/off either through the remote or through HA doesn't seem to change anything. It seems like the integration is hung up, but just for that TV. I have a different LG TV that seems to be working as expected.

abates commented 2 years ago

This is exactly what I have experienced, I can push a version which is based on 0.28 + a fix I did which should work

Let me know when you push the commit and I will try. I think I'm home for the afternoon and have some time to work on this today if you do as well.

thecode commented 2 years ago

Updated: https://github.com/thecode/home-assistant-custom-components/tree/client-reconnect-0.2.8

abates commented 2 years ago

Ok, I checked out the 0.2.8 version and restarted. I'll let you know how it goes.

On a side note, when I resarted homeassistant I got the following message:

2021-12-27 13:58:54 WARNING (MainThread) [homeassistant.runner] Task could not be canceled and was still running after shutdown: <Task pending name='Task-775439' coro=<WebOsClient.callback_handler() running at /usr/local/lib/python3.9/site-packages/bscpylgtv/webos_client.py:314> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0xb0cc0940>()]>>
abates commented 2 years ago

I'm cautiously optimistic. I've tested it for about 10 minutes now and can't get a failure. It looks like the TCP sessions are behaving like I would expect (for the most part). It does seem like it takes awhile for sessions to be closed after the TV turns off, but it doesn't seem to affect the performance within HA.

abates commented 2 years ago

Ok, I just turned on the TV and it is no longer updating in HA. At this point I don't know if the problem is in aiopylgtv or within the WebOS integration.

Edit: It looks like the TCP session is no longer disconnecting, so I have a hung TCP session for the TV even when it is turned off.

abates commented 2 years ago

@thecode the custom component doesn't work for me, it's missing the disconnect that I proposed in set_power_state. I've added in the change that @chros73 proposed in this issue and will let you know if things are working for me.

abates commented 2 years ago

@chros73 so I'm still having a minor problem. I'm using version 0.1.3 of bscpylgtv and things still aren't quite right. I added logging to make the set_power_state look like this:

async def set_power_state(self, payload):
    self._power_state = {"state": payload.get("state", "Unknown")}

    if not self.is_on:
        _LOGGER.warning("Attempting to disconnect from %s", self.ip)
        await self.disconnect()
        _LOGGER.warning("Disconnected from %s", self.ip)
    elif self.state_update_callbacks and self.doStateUpdate:
        await self.do_state_update_callbacks()

The second logging statement is never reached. It seems that await never returns. Additionally, I get this message in my log file when I restart home assistant:

2022-01-15 16:04:10 WARNING (MainThread) [homeassistant.runner] Task could not be canceled and was still running after shutdown: <Task
 pending name='Task-2854344' coro=<WebOsClient.callback_handler() running at /usr/local/lib/python3.9/site-packages/bscpylgtv/webos_cl
ient.py:314> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0xa18939d0>()]>>

This is the same problem I had when I implemented my original fix with asyncio.create_task(self.disconnect())

I really have no explanation why the await hangs forever.

Edit: Should I close this issue and open a new one in bscpylgtv?

chros73 commented 2 years ago

Should I close this issue and open a new one ...

We can stay here if @bendavid doesn't mind or create a pull request over there (Issues are not enabled in my repo for multiple reasons).

The second logging statement is never reached. It seems that await never returns. This is the same problem I had when I implemented my original fix with asyncio.create_task(self.disconnect())

So, do you say that it doesn't matter which one we use, do we have the same outcome with both?

Can you write down the steps with which we can replicate the issue 100% of the time?

abates commented 2 years ago

The second logging statement is never reached. It seems that await never returns. This is the same problem I had when I implemented my original fix with asyncio.create_task(self.disconnect())

So, do you say that it doesn't matter which one we use, do we have the same outcome with both?

I apologize for not clearly stating my observations. When set_power_state awaits self.disconnect() it always hangs for one of my two LG tvs. When set_power_state creates a new task (asyncio.create_task(self.disconnect())) it never hangs.

Can you write down the steps with which we can replicate the issue 100% of the time?

I've created a gist with my test script and output logs.

Steps to reproduce:

1) Run the script 2) Turn the TV on with the remote control 3) Wait for script to log active state 4) Turn the TV off with the remote control

When set_power_state awaits disconnect it will hang every time, when set_power_state calls create_task it appears to disconnect and work correctly.

thecode commented 2 years ago

When you run asyncio.create_task(self.disconnect()) you run the disconnect method on another context so you don't really know if it hangs, it would be the same as if you run something on another process. I am pretty sure it acts the same and the task stay hang forever.

Note that we have switched in home assistant to use our on PyPI package because we must be able to mange issues and releases. I am getting notifications for this issue since I am subscribed to it, but no one from the Home Assistant Team follows this.

You can try this script: https://github.com/home-assistant-libs/aiowebostv/blob/main/examples/reconnect.py

Which is very similar to your test script and is very close to what we currently use in Home Assistant, but I don't expect good news for you as it was mostly focused on getting something stable and close to the current state. Once you have results with the test script we can follow up as an issue in aiowebostv and try to fix it.

The script you linked fails on my setup also and is not related to your specific problem, my TV however is on the same VLAN which is running Home Assistant.

abates commented 2 years ago

Closing this issue, will open a new issue in home-assistant-libs/aiowebostv