dieselrabbit / screenlogicpy

Interface for Pentair Screenlogic connected pool controllers over IP in Python
GNU General Public License v3.0
17 stars 5 forks source link

Error using screenlogicpy across VLANs #18

Open uvjustin opened 2 years ago

uvjustin commented 2 years ago

Coming over from https://github.com/home-assistant/core/issues/55398 The API and HA integration work fine when HA and the ScreenLogic device are on the same VLAN/subnet. When on different VLANs, the discovery still works fine, but the connection seems to fail after that. Running python -m screenlogicpy.cli -i xxx.xxx.xxx.xxx -p 80 get json exits silently with no output. It looks like the TCP socket is opening up and the initial connection gets made, and the first takeMessage() seems correct (it comes back with a (sndCode, msgCode, msgLen, message) of (0,15,24,some data). However the second takeMessage comes back with (0,13,0,b'') so it ends up failing here: https://github.com/dieselrabbit/screenlogicpy/blob/c1554556fcb52c8094adf314eba00fd079a7ab73/screenlogicpy/requests/utility.py#L23 .

uvjustin commented 2 years ago

Just to follow up, the problem happens after sending the login message here: https://github.com/dieselrabbit/screenlogicpy/blob/c1554556fcb52c8094adf314eba00fd079a7ab73/screenlogicpy/requests/login.py#L73 I actually tested it on the old pre async library so I'll try it again on the current library a little later. I looked at the other SL libraries and the protocol documentation at https://github.com/ceisenach/screenlogic_over_ip and everyone seems to use the same login message. Maybe the adapter just doesn't like logins from outside the subnet, and not sure if there is a login message tweak that will fix that.

dieselrabbit commented 2 years ago

Thank you for the detailed info! Unfortunately, it doesn't look like good news.

That fact that screenlogicpy is getting a response means that the TCP socket connection isn't the problem. msgCode 13 however is effectively the protocol adapter saying "I don't understand" or "I don't accept your request".

Based on your information of the sequence of events, it appears to be the response to the LOCALLOGIN_QUERY request. This seems to indicate that the protocol adapter itself is not 'happy' with performing a local logon over VLAN or different subnet.

Looking around online, it seems that this may not be unique to this API and may even happen with the oficial ScreenLogic app: https://www.troublefreepool.com/threads/screenlogic2-not-showing-anything-in-the-local-systems-of-the-login-screen.171636/post-1514649 https://community.netgear.com/t5/Orbi/Pentair-Protocol-Adaptor/m-p/1738107

That said, I want to keep this open as I'd like to spin up a vlan and run some tests, maybe trying different logon methods.

If nothing else, I'll fix the silent failure in takeMessage(). That's an actual bug.

...And you just came to the same conclusion as I was writing this. 🥇

dieselrabbit commented 2 years ago

Providing an update to this, the takeMessage() function was fixed in v0.5.1. Unfortunately, I am unable to reproduce the msgCode 13 behavior.

I've finally set up a vlan/separate subnet with the Pentair protocol adapter and the following firewall rules:

Obviously, discovery doesn't work as that is a subnet broadcast, but direct connection to the protocol adapter via its IP address is working fine for me.

I was also able to confirm that Home Assistant tracked the IP change of the protocol adapter after a reload of the integration and is communicating fine across vlans/subnets.

I did add an new method to the dev branch screenlogicpy.requests.logon.create_local_login_message() that recreates the logon message the PC application sends, but as I'm not running into a problem with the existing implementation I haven't tested it beyond basic functionality. But it's there for anyone to play with.

At this point, it doesn't seem like there is an issue with the library itself. If you have more information, I'll be happy to take a look. Otherwise, I can close this in a few days.

uvjustin commented 2 years ago

Thanks for the update. It's curious that you are unable to reproduce this - in addition to me it seems like a few others have experienced the same issue (@strouja has a thumbs up here, and @burntoc was the one who first reported the issue on HA core). I have the same general firewall rules, but perhaps there are some minor details which are different which end up causing the issue. I'm actually remote from the SL installation now so it will be easier for me to look into this when I am there next month.

rsumner commented 2 years ago

I'm running into a similar situation. If my client is on a different subnet/VLAN, then I actually get a "Request explicity rejected 1". If I move the client to the same subnet, then no errors are produced.

I don't have any restrictions on the pool net or the main net, but I am bothered by the amount of Internet chatter the protocol adapter makes to the outside, so I may block that. I digress and that's for another day, so here's details on my env and error:

I'm running v0.5.4, but I was on 0.4.3 prior to today. Before the upgrade, the client would just return noting, but I'm definitely getting more info now.

% screenlogicpy -i 192.168.10.2 -p 80 -v get json
Fatal error: protocol.data_received() call failed.
protocol: <screenlogicpy.requests.protocol.ScreenLogicProtocol object at 0x1057caa30>
transport: <_SelectorSocketTransport fd=6 read=polling write=<idle, bufsize=0>>
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/selector_events.py", line 870, in _read_ready__data_received
    self._protocol.data_received(data)
  File "/Users/rsumner/src/screenlogicpy/lib/python3.9/site-packages/screenlogicpy/requests/protocol.py", line 48, in data_received
    raise ScreenLogicError(f"Request explicitly rejected: {messageID}")
screenlogicpy.const.ScreenLogicError: Request explicitly rejected: 1
Failed to logon to gateway

I can provide packet captures on the switch if that would be helpful.

rsumner commented 2 years ago

I have a Layer3 switch that segments my pool and main networks. To temporarily solve this problem (which is causing HomeAssistant from communicating with my Pentair protocol adapter), I configured the router/switch SNAT the HTTP traffic destined to the protocol adapter changing the source IP of the traffic to that of the VLAN interface on the router/switch. This way, the protocol adapter will see the traffic from a device on the same network. When doing this, it works like charm.

@dieselrabbit I'm assuming this is some sort of lame security ACL on the protocol adapter itself and not a problem with the library, but would love hear your thoughts.

burntoc commented 2 years ago

I have a Layer3 switch that segments my pool and main networks. To temporarily solve this problem (which is causing HomeAssistant from communicating with my Pentair protocol adapter), I configured the router/switch SNAT the HTTP traffic destined to the protocol adapter changing the source IP of the traffic to that of the VLAN interface on the router/switch. This way, the protocol adapter will see the traffic from a device on the same network. When doing this, it works like charm.

@dieselrabbit I'm assuming this is some sort of lame security ACL on the protocol adapter itself and not a problem with the library, but would love hear your thoughts.

Oh man, thank you for sharing this - it never crossed my mind! I was doing all sorts of multicast forwarding and stuff and it just didn't work. Your fix took me like 60 seconds and - bang - I'm in business!

dieselrabbit commented 2 years ago

@uvjustin Yes, I was actually surprised that I was unable to reproduce the issue. I had to double-check my settings, and that the firewall was working as intended.

@rsumner Your error with v0.5.4 is as expected, and as others have reported where the protocol adapter is responding, but with a message code that is the same as if you sent it a junk request (invalid message code or improper data in the message.) Very little of the protocol is documented so I'm making assumptions that the response is an "Explicit rejection".

Since other online posts for other connection methods/APIs make mention of trouble connecting to a protocol adapter from a different subnet, it stands to reason that it is some sort of ACL. My next course of action was to try the create_local_login_message() method mentioned above to connect, but since I can't reproduce the issue in the first place, I never went that far.

As for why I don't get the issue, it's possible my Unifi system is by default automatically handling the traffic in a way that is acceptable to the protocol adapter, and my firewall rules to allow the desired traffic do nothing to disrupt that. That the only thought I am left with at the moment.

rsumner commented 2 years ago

@dieselrabbit thanks for the feedback. I'm completely happy with using NAT to overcome the ACLs in the protocol adapter when operating across subnets. Thanks for the great work on this library. IMO, I would suggest the only followup to this issue is making a documentation update to this indicating NAT is required for cross-subnet communication when using unicast.

uvjustin commented 2 years ago

I ended up just doing the same as the others, using SNAT to get around the ACL issue. I would agree with the post above in just noting the behavior in the docs and suggesting SNAT as a potential workaround. BTW, I am also using a Unifi system. Perhaps the routing behavior may be different across our different routing devices, or there might also be some other key difference between our network setups. I'm using a USG, so I had to use a config.gateway.json file to implement the SNAT rule.

dieselrabbit commented 1 year ago

This has been brought back up as I am currently implementing retrying of requests and better error handling for unexpected responses. I would still like to understand this better, either to develop a workaround within screenlogicpy or to at least be able to explicitly describe the scenarios in which screenlogicpy won't work. It would be nice though, to be able to better handle the 'Login Rejected' response.

During the previous testing, I had a Layer 2 switch and believed I was seperating my main and pool networks via a VLAN for the pool network. They had separate Networks, separate IP ranges/subnets, and the pool network was tagged with a VLAN id.

I recently upgraded to a Layer 3 switch, keeping and applying all the same network settings from before and I still am not having any problems connecting to the ScreenLogic protocol adapter. This is making me wonder if I am actually segmenting my networks properly.

If anyone is able and willing to assist me with investigating, I'd be very appreciative.

dieselrabbit commented 1 year ago

@uvjustin @rsumner Do you have a remote access password set on your protocol adapter?

rsumner commented 1 year ago

@dieselrabbit My protocol adapter is dead now, but when it was active it DID NOT have a access password set. I'm using https://github.com/tagyoureit/nodejs-poolController with a cheap RS485 USB adapter now.

dieselrabbit commented 1 year ago

@rsumner Doh. Sorry to hear it died, but glad you have a solution to maintain control.

Thanks for the info. To this day I've only ever been able to reproduce the login rejection when on a different subnet/vlan by setting a password on the protocol adapter.