Expose GRACE_PERIOD_FOR_STATE_UPDATE as a service parameter

rtclauss commented 1 year ago

Checklist

[X] I have filled out the template to the best of my ability.
[X] This only contains 1 feature request (if you have multiple feature requests, open one feature request for each feature request).
[X] This issue is not a duplicate feature request of previous feature requests.

Is your feature request related to a problem? Please describe.

I would like to use retry to ensure that lights are turned off when I start to watch a TV series. I use a relatively long transition time of 30 seconds to have a nice dimming effect. However, when light.turn_off is called the lights are immediately marked as off when actually they are in a transition period to off. Due to some zigbee flakiness in the network, not all the lights will turn all the way off. In the retry code it looks like there is an immediate check if the status of entities is in the matching state. Due to this immediate checking and the light.turn_off behavior the lights appear to be off when they're actually not so retry no longer tries again.

Describe the solution you'd like

I'd like an optional state checking grace period on the retry.actions and retry.call services. This way I can verify that the lights are well and truly off.

Describe alternatives you've considered

I could remove the lighting transition but that means I'm losing a nice feature I'd like to have of my automated smart home.

Additional context

Here's some sample yaml of what I'm trying to do:

- service: retry.actions
  data:
    sequence:
      - service: light.turn_off
        target:
          entity_id: 
            - light.basement_group
            - light.living_room
            - light.office
            - light.kitchen
        data:
          transition: 30
    retries: 10
    expected_state:
      - "off"

amitfin commented 1 year ago

Here is the explanation of GRACE_PERIOD_FOR_STATE_UPDATE:

The expected state is verified immediately after calling the service. If the state is different than the expected one, the retry is considered a failure, and the loop of retries continues.
However, there are cases that it takes some time for the new state to get propagated. In the more common case, the new state is getting updated as part of the service call, but it depends on the implementation.
For this edge case (when it takes some time for the new state to get propagated) there is a "grace period". When the immediate check of the expected state fails, we sleep for the "grace period" and check again the state against the expected one. If they are identical after the "grace period" we mark the iteration as success and there is no additional call for the inner service.

All of that doesn't seem to be related to the scenario described above as the state does get updated immediately. Unless I'm missing something, I don't think that increasing GRACE_PERIOD_FOR_STATE_UPDATE(via a new service parameter) can help here.

It seems that the root cause here is related to the implementation of the "transition" parameter by the specific IoT device and/or the integration logic. It starts the dimming but something is preventing its completion. Since the light state is already "off", there is no way to know if the dimming process was completed successfully or not. One possible workaround can be to call again light.turn_off (without the transition parameter) after 30 seconds. Lights which didn't finish properly the transition will be forced to turn off by this extra call.

amitfin commented 12 months ago

I'm closing it based on the above explanation. @rtclauss , please feel free to re-open if you think something was misunderstood.

rawnsley commented 9 months ago

I think I may need the same feature, but for a different reason. I'm using Retry to control some TRVs in my heating system because they regularly ignore/miss instructions sent directly from HA. I don't think it is a signal issue because they sit right next to the zigbee coordinator. Whatever the reason, Retry seems a good way to ensure the instructions eventually get through. However, another quirk of these devices seems to be that when you send a command, the reported value flips back-and-forth between the old and new state even if the command has not been accepted. I suspect that this is confusing the confirmation logic in Retry and that a longer grace period between setting and sampling might be needed.

I appreciate exposing this value as a configuration parameter increases the complexity for the end user, but it seems very much on-brand for Retry: keep doing this thing until it gets done.

I would also accept the ability to set this globally rather than per-automation; the current default of 200ms is more suitable for aircraft traffic control than home heating automation.

amitfin commented 9 months ago

@rawnsley, can you point me to the specifics of your device and its HA's integration? It seems that a proper way to fix the scenario is to fix the integration:

the reported value flips back-and-forth between the old and new state even if the command has not been accepted

I'm not sure that using retry as a workaround for such a nondeterministic behavior is the best approach.

Did you try working with the integration's owner?

amitfin commented 9 months ago

@rtclauss , @rawnsley , v2.5.0 adds state_grace parameter to control the grace period of the expected state check.

rawnsley commented 9 months ago

@amitfin thanks for the update - very much appreciated. Sorry I didn't get back to you about the misbehaving integration, but it's also something I will keep looking into.

amitfin / retry