dieselrabbit / screenlogicpy

Interface for Pentair Screenlogic connected pool controllers over IP in Python
GNU General Public License v3.0
17 stars 5 forks source link

Home Assistant automation dropping actions #31

Closed American1337 closed 1 year ago

American1337 commented 1 year ago

For HA automations, I'm able to get single light/switch automations working correctly. If I'm toggling larger # of entities I'm seeing actions get dropped.

Example below shows a script I want to run every night to ensure that everything is shutoff correctly. I also want to do a script to turn all the pool lights on/off.

The problem is if I don't put a delay, some of the actions get dropped. I find that 600ms has a good success rate. 0ms/50ms miss a few of the circuits.

This isn't a blocker for me, adding a delay is fine, just a bit awkward as I would have preferred a single light off action with a list of entities.

This is the yaml of the working automation. For broken case, delay can be changed to 0 or 50 ms.


alias: "[Pool] Turn Off All Circuits (Slow)" sequence:

dieselrabbit commented 1 year ago

Is there any commonality to which actions are getting dropped (ie. Always the last 3 in the script, or it's always the lights, etc.), or is it different ones every time?

American1337 commented 1 year ago

It's not consistent.

Attempt #1: I turned all circuits on, and then used the script I posted above with the delays removed. Note that "Pool" circuit is not in my script so it is expected to stay in on state, all other circuits should be turned off

image

Attempt #2: I turned all circuits on, and then used the script I posted above with the delays removed. Note that "Pool" circuit is not in my script so it is expected to stay in on state, all other circuits should be turned off.

The results aren't the same but script was identical.

image

Attempt #3: New script I moved lights AFTER switches. Now some lights are staying on. Like before ignore the pool circuit on

image

Script for attempt #3 (lights last)

alias: "[Pool] Turn Off All Circuits (Medium)" sequence:

(Edited to clarify the delays removed in attempt #1 and #2 similar to the script used in attempt #3)

American1337 commented 1 year ago

A few more experiments:

Attempt #4: A delay every 8 actions for 1 second. Within each group of 8 no delay

Result: still broken, some switches near the end of group don't change. Not consistent which are broken.

Attempt #5: A delay every 4 actions for 1 second. Within each group of 4 no delay

Result: all circuits updated correctly.

Script from attempt #4:

alias: "[Pool] Turn Off All Circuits (Burst)" sequence:

Script from attempt #5:

alias: "[Pool] Turn Off All Circuits (Burst)" sequence:

dieselrabbit commented 1 year ago

Interesting. Thanks for all those test cases.

Just to confirm, you verified that circuits were actually still left on in the pool equipment? Just trying to rule out this being just a delay in updating the states in HA.

American1337 commented 1 year ago

For the experiments, I've tried two ways:

First method:

  1. using Android pentair app, set all circuits ON
  2. home assistant run script
  3. exit Android pentair app
  4. restart Android pentair app
  5. view circuit status in Android pentair app

Second method:

  1. using HA, set all switches/lights on ON
  2. confirm circuits are on in Android pentair app
  3. home assistant run script
  4. exit Android pentair app
  5. restart Android pentair app
  6. view circuit status in pentair app

I've also confirmed visually. The automation without delays only turns off some of the lights. Then automation with delays turns them all off.

dieselrabbit commented 1 year ago

Well that pretty certainly rules out this being simply a delay in updating the states. Thanks for being so thorough.

The (hopefully) last thing I'd be interested in is Home Assistant logs for a not-working script being run with Debug logging turned on for the Pentair ScreenLogic integration. That should show what is actually being sent to the screenlogic protocol adapter and what is and isn't coming back. I'm wondering if the protocol adapter just can't keep up with that many calls that quickly.

You should be able to turn on Debug logging in the integration options from the Integrations page.

American1337 commented 1 year ago

Attaching logs.

I enabled debug logging then restarted HA, then used HA to turn on all the circuits, then ran the script.

After script was executed, I waited a bit, then captured log, diagnostic dump and screenshot of the dashboard.

Dashboard after script: image

American1337 commented 1 year ago

Attaching logs from a passing case. Turned out my 1 second delay every four switches wasn't enough to make it reliable, that failed a few times this morning. I added a 1 second delay per-switch and it passed a few times.

script-working-with-delays.txt passing-diagnostics-after-script-was-ran.txt *passing-home-assistant.log

Dashboard: image

dieselrabbit commented 1 year ago

This is great data. Thank you.

No solutions yet, but a few things I am seeing:

  1. HA seems to be requesting a complete updated pool status every time a circuit is switched. I don't think that should be happening. The HA integration should be "debouncing" the update requests so that only one status update happens per 2 seconds. This seems to be flooding the protocol adapter with a lot of incoming requests.
  2. The protocol adapter is responding to (I assume) the failed actions with a message code of 31, which other SL projects identify as a "bad parameter" response. If that code truly represents a bad parameter, then those circuits (and really all circuits) should never function. But as your previous testing shows, the circuits getting code 31 have worked previously. Wondering if 31 is more of a "I'm full, please wait" response.
  3. I'm surprised screenlogicpy is not throwing an error on the unexpected returned code 31. Only early versions cared heavily on the returned message code.

Like I said, no solution yet, but I have some things to investigate. Thanks again.

dieselrabbit commented 1 year ago

I don't think this is the final solution, but wondering if you could test this change in HA.

Currently the integration is using the debounce only to catch any late updates on the pool controller, and not actually debouncing refresh requests.

This change forces circuit changes (circuits and lights) to use the debounced refresh, and shortens the debounce cooldown to 0.5 seconds. This should reduce the requests to the protocol adapter dramatically when using your script, but hopefully still feel "snappy" enough when switching circuits on and off.

dieselrabbit commented 1 year ago

@American1337 Just curious, have you tried setting your script up like this:?

alias: "[Pool] Turn Off All Circuits (No Delay)"
sequence:
  - service: switch.turn_off
    data: {}
    target:
      entity_id: 
        - switch.pentair_20_17_74_jets
        - switch.pentair_20_17_74_slide
        - switch.pentair_20_17_74_spa
        - switch.pentair_20_17_74_spa_waterfall
        - switch.pentair_20_17_74_upr_pool
        - switch.pentair_20_17_74_upr_wtrfall
        - switch.pentair_20_17_74_water_feature
        - switch.pentair_20_17_74_waterfall
        - switch.pentair_20_17_74_upr_cleaner
        - switch.pentair_20_17_74_cleaner
  - delay:
      hours: 0
      minutes: 0
      seconds: 1
      milliseconds: 0
  - service: light.turn_off
    data: {}
    target:
        - light.pentair_20_17_74_color_wheel
        - light.pentair_20_17_74_color_wheel_2
        - light.pentair_20_17_74_fiber_optic
        - light.pentair_20_17_74_pool_light
        - light.pentair_20_17_74_pool_light_2
        - light.pentair_20_17_74_pool_light_3
        - light.pentair_20_17_74_pool_light_4
        - light.pentair_20_17_74_pool_light_5
        - light.pentair_20_17_74_u_pool_ligh
        - light.pentair_20_17_74_spa_light
mode: single
American1337 commented 1 year ago

The fix doesn't work. I'm 99% sure I merged in the change correctly.

This is the log with the change included. I also did a compact script. What you suggested is similar to what first tried and I only split it up to test with delays inserted.

dashboard-after-script

American1337 commented 1 year ago

FWIW, if I take out switch turn off and just do the last light action, the lights remaining all turn off.

American1337 commented 1 year ago

If I put 1 second a delay between switch.turn_off and light.turn_off, it didn't work. 5 seconds didn't work either. But 10 seconds works.

alias: "[Pool] Turn Off All Circuits (Two Statement with Delay)" sequence:

dieselrabbit commented 1 year ago

FWIW, if I take out switch turn off and just do the last light action, the lights remaining all turn off.

Well that confirms that the response it's giving (31) isn't being given to indicate a bad parameter in the request.

dieselrabbit commented 1 year ago

While I still intend to solve this for the script you created, as a work around you could try using the screenlogic.set_color_mode service to shut all lights off with one command to the protocol adapter.

service: screenlogic.set_color_mode
data:
  color_mode: all_off
target:
  device_id: [your_pentair_device_id]

You can get the device_id by using the Services Developer Tool and building the service call through the GUI.

American1337 commented 1 year ago

Nice to know about that service. tested it and screenlogic.set_color_mode works for lights all on/off.

dieselrabbit commented 1 year ago

Are you able to use that service call in your script for all the lights instead of the individual light.turn_off calls?

American1337 commented 1 year ago

Yes, I'm using color_mode for now.

I'm also keeping delays around the switch on/off actions, so it would be nice if this issue can be fixed but I'm unblocked for doing my integrations.

American1337 commented 1 year ago

Update: I used screenlogic.set_color_mode again today and it doesn't seem to work reliably. I've gone back to using individual light switch actions with 2 second delays between them.

How is screenlogic.set_color_mode implemented? Is it 1 message to the pentair system or multiple under the hood?

dieselrabbit commented 1 year ago

It's a single command to the pool controller. The different color modes represent different buttons in the Lights section. In your case, all_off should do the same as the All Off button in the Color Lights page in the app. Screenshot_20230129-220527~2.png

American1337 commented 1 year ago

Got it, I tried it again, it's always the same two lights that don't work with all_on. Even when I try with the app. Must be an issue with my pool config. I will follow-up with the pool system installers.

For now I will use all_on + two individual light actions.

image

dieselrabbit commented 1 year ago

Just released v0.7.0 of screenlogicpy which supports subscribing to push updates from the ScreenLogic system. This will reduce the request traffic and hopefully enough in your case to not overload the protocol adapter and have it return an unknown message code.

Changes to the Home Assistant integration to use this functionality are being finalized now. You can follow the progress here.

American1337 commented 1 year ago

Awesome. Thanks for the updates. I will follow other thread.

(P.S.: Note that I'm having some issues with the pentair systems unrelated to HA -> issue after power loss (circuit names missing). I am calling back the folks that installed it. Likely some latency before I will be able to re-test).

American1337 commented 1 year ago

I've installed the updates, I do see changes from the android pentair app reflected pretty much immediately in HA now, that's pretty cool.

However, the script to turn off all circuits without delays still fails. Attaching recent log.

Note that some of my sensors are missing now, but presumably it's because I just updated screenlogic and not the rest of HA. "Unable to prepare setup for platform screenlogic.binary_sensor: Platform not found (cannot import name 'EntityCategory' from 'homeassistant.const' (" shows in the logs.

dieselrabbit commented 1 year ago

Well the good news is that the amount of total requests during this sequence IS significantly reduced, as Home Assistant is not trying to refresh the state of the entire pool system with each circuit command.

The bad news is that we're still getting the same error code response (31) from the system after a number of commands, in this case it seems to be failing on the 15th command made in the span of 0.139sec. It takes a little over 6sec from the time the circuit commands are sent for an updated status message to be received from the pool.

It's possible this is some sort of rate limiting control on the protocol adapter. Assuming your ScreenLogic is set up with a wireless link between the pool controller and the protocol adapter as mine is, it's also possible that there's a rate limit to the wireless communication between them.

As for the missing sensors, yes it looks like it would be caused by a mismatch between the screenlogic integration code and the core Home Assistant code. There was a change 2 days ago that affected both portions and they would both need to be updated to stay in sync.

Despite not knowing why the protocol adapter is rejecting the commands or currently how to resolve it, this situation does highlight some improvements that can be made in the api library. For one, currently the api is not validating the response code for a request, only if the data in the response is what's expected. For commands that expect an empty response, that leaves the door open for a lot of false positives that a command was successful. Truely, Home Assistant should be throwing an error when those light commands fail, but the api think's they went fine. Secondly, if a command were to be detected as failed, there's currently no way to retry the command.

I have some ideas of how to resolve both of those, but even then I don't know if it would resolve this issue.

Curious if there was any result from the installer?

American1337 commented 1 year ago

History on the protocol adapter install:

Off topic, but I'm curious, is a battery backup for circuit names typical? Seems odd to me that they wouldn't put some simple non-volatile memory to persist names across power losses.


I've also realized the penair desktop app lets me change the circuit names myself, so in case of power lose in the future I will be able to restore the names.

Furthermore, I've fixed the all light offs command. The installer had only set the first 6 lights. If I add four lights to the second six, I'm able to get fix that other issue I had, now I can control all 10 light circuits with all_on/all_off.

image

dieselrabbit commented 1 year ago

Off topic, but I'm curious, is a battery backup for circuit names typical? Seems odd to me that they wouldn't put some simple non-volatile memory to persist names across power losses.

My EasyTouch board has a button cell battery mounted on it, similar to the CMOS battery on a motherboard. It's conceivable that the battery went dead and needed to be replaced. Happens on motherboards and the BIOS settings are lost.

I've also realized the penair desktop app lets me change the circuit names myself, so in case of power lose in the future I will be able to restore the names.

Hopefully you won't have to now that the battery was replaced.

Furthermore, I've fixed the all light offs command. The installer had only set the first 6 lights. If I add four lights to the second six, I'm able to get fix that other issue I had, now I can control all 10 light circuits with all_on/all_off.

Nice find. I had thought that as long as it was a light-type in circuit function the all_on/all_off would work, but looking in the ScreenLogic manual now, you're right. It clearly states that light circuits need to be added to the Special Lights section for lights to be affected by those commands.

dieselrabbit commented 1 year ago

FYI, the dev branch contains updates to retry failed requests if you are interested in trying it out.

American1337 commented 1 year ago

Nice, I will test it when I can. Unfortunately, my pool system is broken again. Not sure what happened but my circuit names got wiped out again, the installer is going to come review/fix.

dieselrabbit commented 1 year ago

Roger. That's a bummer. At your convenience.

American1337 commented 1 year ago

I've merged in the latest from core-dev/homeassistant/components/screenlogic and screenlogicpy-master/screenlogicpy and I'm still seeing issues when I don't put delays in the automation.

Attaching the latest log: home-assistant_2023-03-07T07-01-24.286Z.log

A few asides:

dieselrabbit commented 1 year ago

Thanks for following up on this.

Can you confirm that you are running the latest code for screenlogicpy? I'm not seeing any log output that indicates it is retrying failed (or rejected) requests. I just confirmed that there should be some output along the lines of: (code and request will be different for you)

Bad Parameter for request code: 12572, request: b'\x00\x00\x00\x00'. Will retry 1 more time(s) in 2 seconds

The dev branch has this, but master is now the absolute latest. v0.8.0 of the package on pypi should also work for testing, but does have a bug in it with keeping the connection alive via ping. I don't think you'll run into that though as you're requesting pump status at regular intervals.

American1337 commented 1 year ago

I'm not sure what happened earlier, I was pretty sure it had updated to the new version but perhaps not.

Reattaching log that does show "Will retry 1 more time(s) in 2 seconds"

home-assistant_2023-03-07T20-15-05.851Z.log

One of the pool lights were stuck on after running the off script without delays. Ignore pool circuit, the script doesn't turn it off.

image

dieselrabbit commented 1 year ago

Thanks. I do see the retry message in the logs now, and it appears that each retried request is succeeding. I'm not seeing any obvious failures of the retry mechanics. Could you send new diagnostic info now that the hardware has been updated? Would like to eliminate circuit numbers that are listed as succeeding from an updated list of your circuits.

American1337 commented 1 year ago

Attached config_entry-screenlogic-d548c3acdc7509787a8fd6744300fc1d.json.txt

dieselrabbit commented 1 year ago

Thanks. After looking at the data I have some questions:

  1. Does the log depict 2 runs of the turn-off script?
  2. If so, on the first run were there 2 lights left on? Spa light and one of the Pool Lights? Not that I see an indication of these failing, but these are the circuits that show a failure->retry->success sequence before a time-gap in the log.
American1337 commented 1 year ago

I'm not 100% sure as I redid the experiment 4-5 times, but I think the sequence for the log I posted was:

1.) first attempt -- all pass 2.) delay 3.) second attempt -- 1 light failed

dieselrabbit commented 1 year ago

Ok. It doesn't look like the fail->retry->success instances shown are of any bering to that last light remaining on. In fact, the final status update screenlogicpy received shows that the light remaining on wasn't any of the ones that went through the retry sequence.

What is really interesting is the second to last line in the log, which was also a status update received by screenlogicpy less than half a second before the final one. It shows the "problem" light as off. Yet less than half a second later it is reported as on, with no action from screenlogicpy. And both of these update messages arrive almost 7 seconds after the final command is acknowledged by the protocol adapter. In this case, it doesn't seem caused by screenlogicpy.

Ruling out you accidentally turning that circuit back on after the test, or the light actually being off and the protocol adapter just not sending the update in a timely manner, the only things I can venture a guess at is possible poor RF communication between the protocol adapter and the pool controller, or somehow the pool controller itself failed in turning off the circuit and then eventually reported correctly as such.

Either way, I don't have any evidence that screenlogicpy is not doing what it is designed to do. In fact, your log helped to verify for me that the retry mechanism is working in a real-world scenario. For that, I'm thankful. I'm sorry there isn't a better answer for you.

As such though, I'm inclined to close this issue at this time. In the future if there is evidence, in the logs or otherwise, that screenlogicpy is doing something bad to contribute to this issue, we can always revisit.

American1337 commented 1 year ago

I'm doubtful it's an RF link quality problem, the RF transmitter is about 15 ft from the receiver.

Regardless, I also expect the bug is on the pentair side and not in screenlogicpy. I'm okay closing this out as I'm unsure at this point what more you would reasonably do. If you don't have it already, you could consider a traffic stress test, i.e.: loop that sends high volume traffic to a pentair bridge and see if anything gets dropped. But even if you did that and found the link wasn't reliable I'm not sure what else would you do in screenlogicpy .

At this point, I have all the integrations I wanted working, and using all_off/all_on helped a lot. I put a few 500ms delays in the scripts to help avoid the dropped command issue but the delays are not noticeable.

Thanks for looking and getting all the fundamental issues resolved. It's has enabled some very cool integrations, for example, I have an engraved Lutron RA3 keypad for the pool features next to patio light scenes. It's much nicer than what we had before with the pentair remote outside or the phone app.

image

dieselrabbit commented 1 year ago

It's a reasonable thought about a stress test setup. I've been reluctant as I don't have very many aux type circuits that I'm comfortable with potentially turning off and on quickly, but maybe it's valid with repeated data requests.

I'm glad you have found a way to make it work. That scene setup is so clean. 😍 Thanks for sharing. Nicely done!