Behaviour of the 'retain' option, state.json file and publishing state on zigbee2mqtt restart

Koenkk / zigbee2mqtt

Zigbee 🐝 to MQTT bridge 🌉, get rid of your proprietary Zigbee bridges 🔨

https://www.zigbee2mqtt.io

GNU General Public License v3.0

12.15k stars 1.68k forks source link

Behaviour of the 'retain' option, state.json file and publishing state on zigbee2mqtt restart #3496

Closed uncle-fed closed 4 years ago

uncle-fed commented 4 years ago

I cannot understand the behaviour related to the 'retain' option and find almost no information online about how it is supposed to work to understand whether this is a bug or a feature.

The documentation states: "Retain MQTT messages of this device (default false)." and that's about as much explanation as you get. Retain it how / where? Is it about instructing MQTT broker to keep message "permanently" or is it about zigbee2mqtt's own "persistent cache"?

I am guessing (perhaps wrongly because the actual behaviour seems to contradict my guess) that if the retain option is set to true, then every time you get new message for a device, the values are sent to MQTT broker and additionally most recent values are stored in the state.json file. When you cold-start zigbee2mqtt, it reads the state.json file and basically re-publishes its contents to the broker again. Is this correct?

What should happen if I do NOT want this 'republishing on start up' happening? It seems that no matter what I do (setting 'retain: false' for each device individually, or in the _deviceoptions globally, the state.json file is still being written and updated for all of my devices (and it gets re-published to devices' topics on zigbee2mqtt restart).

I tried stopping zigbee2mqtt service, setting retain: false for all devices, erasing the 'state.json' file entirely from disk and starting zigbee2mqtt again. It is true - nothing will be republished first time you start zigbee2mqtt, when the state.json file is not there. But once new messages start arriving the state.json is getting populated again and a subsequent restart causes messages being republished for all of my devices.

Now, usually it would be not a huge deal (most sensors would get overwritten with newer data anyway), but one of the things that gets republished on the start up is the state of the Aqara occupancy sensor that for whatever reason in my case turned out to be true , triggering all kind of automation upon zigbee2mqtt restart (when there is no motion happening in reality).

So, to sum it up, it seems like in my case retain: true does not work as expected (if my expectations are correct, that is). The state.json file is getting written and updated even with retain set to false for all devices (individually or via the _deviceoptions section of the config file).

Running the latest released zigbee2mqtt on Raspbian with the Texas Instruments CC1352P-2 board:

{"version":"1.13.0","commit":"55e6283","coordinator":{"type":"zStack30x","meta":{"transportrev":2,"product":2,"majorrel":2,"minorrel":7,"maintrel":2,"revision":20190425}},"log_level":"info","permit_join":false}

Koenkk commented 4 years ago

Retain is an MQTT specific option, see https://mntolia.com/mqtt-retained-messages-explained-example/

What you are probably searching for is the cache_state option (https://www.zigbee2mqtt.io/information/configuration.html#configuration)

uncle-fed commented 4 years ago

Thank you, this is clearer now but only a bit. In fact it, it provokes some further questions.

The documentation page says: "_cachestate: MQTT message payload will contain all attributes, not only changed ones. Has to be true when integrating via Home Assistant (default: true)"

Does it follow from this that the state.json file's only purpose is to keep all attributes last values so that they can be sent together (if _cachestate is set to true)? Or is state.json used for any other feature of zigbee2mqtt?

Because if this file's purprose is only to enable the _cachestate: true, then I don't see how and why the existence of this file should force publishing its contents to the topics during zigbee2mqqt cold start. Keeping attributes together (and using state.json file for this purpose) is one thing (and is very useful in certain cases), but unconditional announcement of "some previously cached values" upon startup as "the actual current state" is a rather questionable, especially if this behaviour is only controlled by the mere existence of cached data.

Imagine (as it was in my case) that there was a power failure or network connection interruption for the device that runs zigbee2mqtt and then the services are restored after a while. We don't know for how long zigbee2mqtt was down. Could have been seconds, could have been hours. And yet the "random cached state" was published as the current one causing all sorts of wrong automation to kick in. It shouldn't happen this way just because the bridge software component (zigbee2mqtt) is being restarted. Should it?

Can we de-couple republishing of "last known state" on zigbee2mqtt startup from the _cachestate option? BTW, maybe somebody can explain to me how this publishing on startup is useful at all.

uncle-fed commented 4 years ago

Additional note about the 'retain' option since my confusion started with it.

If it is purely MQTT protocol specific option, shouldn't the global setting belong to the mqtt section of the configuration.yaml instead of the _deviceoptions ?

If the argument here that it is also a device specific option, then perhaps it should be made more obvious either in the documentation, or even better, by renaming these three options to:

mqtt_retain
mqtt_retention
mqtt_qos

or even

mqtt_retain
mqtt_v5_retention
mqtt_qos

Koenkk commented 4 years ago

Does it follow from this that the state.json file's only purpose is to keep all attributes last values so that they can be sent together (if cache_state is set to true)? Or is state.json used for any other feature of zigbee2mqtt?

It's also used for some other features (e.g. group behaviour, skipping state publish when device is already on).

Can we de-couple republishing of "last known state" on zigbee2mqtt startup from the cache_state option?

I"m not sure if this makes sense. Let's say we have an occupancy sensor with illuminance and occupancy and skip the republish on startup and still have occupancy: true cached. When the sensor sends the illuminance, the occupancy will also be republished which can be invalid at this point.

Additional note about the 'retain' option since my confusion started with it.

Retain can be set per message, so that's why it's configurable per device. In case you want to set it to true or false by default you can use device_options. https://www.zigbee2mqtt.io/information/configuration.html#changing-device-type-specific-defaults

uncle-fed commented 4 years ago

Thank you the answers to some of my questions. Somehow I feel that I am failing to convey the essence of my issue here. I apologise for touching several topics at the same time. Let me try and concentrate on the most important point for now (the rest of the questions may come later).

How is it useful to republish anything during zigbee2mqtt startup (except for maybe announcing zigbee2mqtt's own state)?

What problem does this re-publishing solve or what issue does it address?

How is it any good for a home automation system to be fed some values of undetermined age (essentially "any old values" once zigbee2mqtt is stopped and started) as if they were the actual / current state?

This is a genuine question, not a sarcasm or rhetorical question. I have not tried every home automation system that works with MQTT. Maybe there is some software that benefits from this that I am not aware of?

With my quite basic set up (Mosquitto + NodeRed + Influx/Telegraf) this feature of dumping cached data out as if it was the current state actually does "occasional harm" and only creates unwanted issues. The potential for wrongly invoked automation is basically there every time I have to stop and start zigbee2mqtt.

And I should add that of all components, zigbee2mqtt is the one that is stopped and started most often due to frequent new releases / updates (updates are great thing, thank you!). Running the update routine often takes many minutes on Raspberry Pi due to npm dependencies resolution and serial module recompilation. So when zigbee2mqtt returns to life the actual state of devices changed already multiple times (mostly sensors, of course). It makes zero sense to feed all topics with some old values in this case. Why??

I really fail to see how it can be any good to take a run-time cache (that is indeed very useful for many other different reasons) and suddenly turn its meaning upside down and treat it as the "source of truth for the system's state" during start up...

I would like to avoid getting the answer of a kind "well, if it is not useful for you, then don't use it" for two reasons:

1) currently in order to disable this 'feature' my only choice is to set cache_state: false, which will also disable another functionality that I actually find useful (even without HomeAssistant)

2) I want to understand why somebody would want to have republishing on startup enabled, because maybe I am just missing some basic understanding of "how things should work in principle".

What I am actually going to do now, to avoid this unwanted behaviour, is to use the following directive in systemd config, i.e. always force-delete state.json file on startup.

[Service]
ExecStartPre=rm -f /opt/zigbee2mqtt/data/state.json

I will still keep cache_state: true in my settings because I still find it useful for other reasons. Yes, it will take some time to rebuild the cache from events if it is gonna get emptied on restart. But so what? I would rather live with "unknown state" for a little while than with some fake/random state that may trigger unwanted behavior in my home.

Regarding this point:

Let's say we have an occupancy sensor with illuminance and occupancy and skip the republish on startup and still have occupancy: true cached. When the sensor sends the illuminance, the occupancy will also be republished which can be invalid at this point.

Sure, you're absolutely right. And that is exactly why you don't want ANY cache present during startup, even if it is enabled in the settings. It will be rebuilt eventually, just like it did when you started zigbee2mqtt the very first time on a new system. I don't suppose Home Assistant (that requires all attributes bundled together and caused the _cachestate: true option to be made for it) will explode or break if not all attributes are there from the start, will it? ;-)

Koenkk commented 4 years ago

Don't get me wrong, this is for sure a valid discussion. It puzzles me a bit that these questions haven't been asked before (because this is in from the start).

I think both persistent and non-persistent state have its pros. Hopefully this also answers some of your questions.

Persistent state

In case Zigbee2mqtt was only down for a few seconds, it's very likely that the cached values are still valid. Many users are running on Docker or Hassio (also Docker under-the-hood) where updates only take a few seconds. Having to rebuild the whole state every time this is done can become annoying. E.g. Xiaomi temperature devices only report their temperature once a hour, or even worse the battery level every 24 hours. This becomes even more annoying when you (automatically) update Zigbee2mqtt every day (through e.g. watchtower), however probably only a minority of the zigbee2mqtt users has this setup.
Home Assistant specific: errors are logged every-time when partial payloads are published (annoying for battery which can take 24 hours to be available again).

Non-persistent state

Guaranteed that the published state is correct, which can prevent automations from triggering due to outdated data.

I propose to add a new option cache_state_persistent which allows to control wether state is saved in state.json. Setting this to false will functionally do the same as you are currently doing with rm -rf. The default value should be true (because it doesn't seem to be a big deal for the majority of users).

sjorge commented 4 years ago

My 2 cents in the game, the current behavior is probably fine for a default. I'd rather have slightly out of date sensor data for example for a Xiami Temp/Humid/Pressure sensor than no data when I restart node-red.

48% humidity from 40m ago is probably still valid and if I missed an update, it's still better than having it uninitialized so it displays 0%...

Would be nice to toggle it per device though.

uncle-fed commented 4 years ago

It puzzles me a bit that these questions haven't been asked before (because this is in from the start).

It puzzles me too! :-) Before posting I really searched and searched to find anybody else with similar concern to no avail.

I think both persistent and non-persistent state have its pros.

I agree with all the pros for persistent and non-persistent cache that you've outlined very nicely.

I would note that most people who "don't suffer like I do" seem to base their argument on the fact that "it is useful to have persistent cache in case of quick restart". As if zigbee2mqtt is only "quickly restarted" and there is no other scenario when it can be down and then up again for longer than a second. What about power outage? What about system reboot due to OS patches? What about low spec systems such as Raspberry PI where it takes minutes and minutes to update zigbee2mqtt to new version even with fully automated scripts? I am not buying the argument about "quick restart" because I have a few of the systems running where from time to time it is required to bring down one component or the other for longer than a second or a minute or an hour, and when its zigbee2mqtt's turn, it sometimes results in a completely unwanted automation triggered.

If you only collect sensors data in a database to build beautiful graphs (I also do this), then of course cache / no cache / state / no state - who cares. But what if you have some "heavy" automation that depends on innocent events or sensor data? What if the automation is "open garage doors" ? Or "start a boiler" or a sprinkler outside? Or turning lights at 3am in your bedroom? Some of this may trigger really heavy actions that depend on innocent events that can get cached and then pop out in most undesired way. It is about system reliability when you start controlling more critical components with it...

So yes, please, the proposed _cache_statepersistent option would be very welcome.

However...

To keep the discussion complete, I would like to go back to one point that is still not being addressed.

OK, we have cache. Fine. Some people prefer to have it persistent. Also fine. We now even may get an option for it. Great!

But why do we want to publish cache contents during zigbee2mqtt start up? I am sorry to sound like a broken record but this question still remains unanswered...

Can we mentally separate, at least for the sake of this discussion, having persistent cache and dumping its contents into the broker when zigbee2mqtt starts. One should not automatically imply the other!

I do understand that if there is cached data and new attribute comes from a device, zigbee2mqtt will publish a message that would contain all cached attributes anyway. I do get this fully. But WHY FORCING THIS on start up? For all devices. Unconditionally.

Let zigbee2mqtt publish when new actual messages arrive (even together with cached attributes). We shall get them when we get there... But who or what requires all attributes from all devices in one go when zigbee2mqtt starts?

This potentially takes us to another lengthy discussion about state keeping in a distributed system, I know...

And here I would really argue that zigbee2mqtt is a bridge component and it should stay that because it performs its role beautifully. It should not pretend to act as a state keeper in a distributed system (exactly what it tries to do when it does "sending its cache out on startup"). Bridge software should try only to fulfil its role: to reliably deliver messages in a format that is understood by the parties on both ends. But what it does on startup is really "messing with the state". There are much better components designed for state keeping purposes with their own great persistence mechanisms in place.

I am talking about most primitive forms of state keeping, such as MQTT broker itself that can retain messages and republish them as "last known state" to each new client that subscribes to a topic (even if the message is not new, it can be kept with 'retain' option).

I am talking about more sophisticated forms too, like NodeRed that is perfectly able to keep its state in memory and persist it on disk and even in a proper database that it can connect to. (So I'm not really getting @sjorge argument about "not having any data in NodeRed when you restart NodeRed". It should not be zigbee2mqtt responsibility to keep the idea of state for NodeRed in case NodeRed is restarted, because NodeRed is able to take care about it's data persistence without any help of zigbee2mqtt).

I am talking about even more sophisticated software like OpenHab, HomeAssistant, etc. that act as a centrepiece of home automation and often have state coming not only from MQTT but also from other sources and APIs.

All those systems have been designed to act as state keepers in one way or another. And it is their responsibility to keep the complete system's state (persistent or not) the best way they see fit.

Zigbee2mqtt does not need to fill those shoes. It's role is to pass messages from raw hardware into something that is universally understood (i.e. MQTT) Let other components care about state keeping.

And if we agree on this, then the feature of publishing the contents of persistent cache on startup has no use and is in fact rather "evil" in a properly architected distributed system.

Koenkk commented 4 years ago

This is only done if the Home Assistant integration is enabled (so when homeassistant: true in configuration.yaml as Home Assistant itself does not persist the states across reboot. So an important thing to keep in mind that it's not the default behaviour but a Home Assistant specific workaround.

Similar to https://github.com/Koenkk/zigbee2mqtt/issues/3496#issuecomment-625357533 having to wait a hour before temperature values come in can be annoying. I just noticed this issue being created which is exactly the reason why this is done: #3511

Edit: I was wrong, it's not only Home Assistant specific. Actually there are points were zigbee2mqtt sends all cached states

On startup (always) -> for this one an option should be created e.g. cache_state_send_on_startup. The honest answer on why this is here is that I don't know why it has been added there (for what purpose) and can't think of a reason why it should be here.
When Home Assistant publishes to hass/status online (https://www.zigbee2mqtt.io/integration/home_assistant.html#mqtt-discovery) (only enabled when homeassistant: true)

uncle-fed commented 4 years ago

Thanks for confirming what I've been observing and tried to tell all along. It does not matter if homeassistant option is set to true or false. The cache is always sent out on startup.

Can we just discuss this a bit further before more options are created? I am not entirely sure that something like _cache_state_send_onstartup is actually needed.

I understand now that Home Assistant can benefit by getting some cached state from zigbee2mqtt, specifically when Home Assistant restarts with empty state and signals about that via hass/status topic. And it is perfect that there is already option controlling zigbee2mqtt behaviour in this case (homeassistant: true).

But this still does not address the "always publish zigbee2mqtt cache on startup". Is it maybe done because HomeAssistant running on the same host as zigbee2mqtt after OS reboot could start earlier than zigbee2mqtt and then it will never have a chance of getting last known state via hass/status? If yes, then this sounds very much like "publishing cache on startup" is yet another Home Assistant specific behaviour and thus it does not need yet another option controlling it. Sending cache out on startup should be just tucked under the already existing option homeassistant: true (just like you thought it should work in the beginning of your last comment). I think this probably was the original intention anyway but somehow it slipped through.

Would you agree to that rather than creating yet another option?

sjorge commented 4 years ago

then this sounds very much like "publishing cache on startup" is yet another Home Assistant specific behavior and thus it does not need yet another option controlling it.

People may still want this for none home assistant setups too.

I use it in node-red as well, although I could work around it in node-red if I really wanted to. But it would mean most of my flows will then need load/store logic to be added to them which would get complex very quickly.

Probably something I could fix with spending a weekend or so working on my flows though, so not the end of the world.

And sure, I could enable homeassistant integration to keep the old behavior but then I'd get a lot of non-interesting mqtt traffic that is specifically meant for homeassitant.

uncle-fed commented 4 years ago

Side note / slightly off topic / big IMHO / maybe food for thought.

It saddens me a bit that zigbee2mqtt has way too many Home Assistant specific options and related logic implemented for a software that is titled zigbee2mqtt. It is not zigbee2ha, is it? ;) And yet zigbee2mqtt it is trying quite hard to plug what I would rate as some rather basic design flaws in Home Assistant.

I appreciate that I should be careful when I say flaws because I never used Home Assistant (and now have even less desire to try it :-)) so maybe I am missing some core knowledge here. But I am merely approaching this topic from architectural point of view (something that incidentally I do as my daytime job) and I am treating our home automation world as a typical distributed system, because we have at least 3 components with asynchronous message flows (zigbee2mqtt, MQTT broker and some consumer like HA or NodeRed), but often there are even more (I am running NodeRed, InfluxDB, Grafana, Telegraf, my own web UI for it with server-side scripts, etc.).

So we do have a distributed system. Sometimes highly distributed system. We have independent message flows between the components (by nature of MQTT). We need to have state in order to act properly. Whose responsibility is to ensure state consistency and, if needed, persistence?

Surely, not the piece of software that does message conversion. This is the LAST place to look for anything state-like. And yet, this is exactly where we are looking when we are discussing that HomeAssistant specific stuff. Implementing very particular options and behaviours just to satisfy somebody else's... laziness?

Some may say - this is not laziness, the problem is there because many (most?) of the sensors' state cannot be queried directly. They only emit events (battery % reading is still an event) and state should always be constructed from those events. But the whole system design went somewhat weird when we talk about Home Assistant-like approach. It is great that zigbee2mqtt has very smart and responsive developer(s) behind it who try to "make things work" for others. But actually it feels like some border has been stepped over.

In a distributed system, it is a design flaw if a system component that is supposed to accumulate events/state and act on them (NodeRed, HomeAssistant, etc) cannot gracefully handle partial or full absence of some data attributes. And it is a design flaw to solve the above design flaw by forcing specific behaviours in another piece of software to "please" the component that is too lazy to do its job right.

This is against the core paradigms of microservice driven design. This is against well architected distributed framework, where each component does its job as a little black box without any knowledge of how other black-boxes do their job. The components speak common language (MQTT) and pass events to each other and that should be enough. No, enough is not the right word. Event should be maximum allowed level of abstraction for communication. Otherwise we are creating new application protocols on top of MQTT (bad).

When one component starts commanding very specific behaviours and implementations from another component we have what is known as tight-coupling, which is the root of all evil in a distributed system.

If Home Assistant (or NodeRed, or any other software) claims to have support for MQTT, and yet it demands that messages should come packaged in a certain way (like having all attributes together) or else it will create "errors" and "warnings", then this is not really a proper "support" for MQTT, this something like "MQTT+" (and software such as zigbee2mqtt has to conform with it too as the result).

If software needs to have "state" of other components to function properly and expects state to be served on a plate, then this is wrong design. In a distributed system there is no "state", there are only "events".

If a piece of software actually requires state - it's the job of that piece of software itself to construct state from events and hold on to it. And if it is important not to lose the state (because nobody can serve complete actual state of the entire system on demand) then again, it is the job of that piece of software that needs it, to ensure that the state is persistent enough for the needs of that component.

uncle-fed commented 4 years ago

I use it in node-red as well, although I could work around it in node-red if I really wanted to. But it would mean most of my flows will then need load/store logic to be added to them which would get complex very quickly.

You might find that with NodeRed this is much easier than you are probably thinking.

https://discourse.nodered.org/t/a-guide-to-understanding-persistent-context/4115

https://nodered.org/docs/user-guide/context#saving-context-data-to-the-file-system

Koenkk commented 4 years ago

To first clear this out, the resending of all cached states on startup is NOT something implemented for Home Assistant, actually Home Assistant even doesn't need this (given that only z2m is restarted home assistant already has the states).

Why was it implemented? I don't know. Ideally this shouldn't be the default behaviour, or this shouldn't be in at all. However given that removing it now would introduce a breaking change, an option to disable this seem to be the correct way forward. This is something that can be further addressed in Zigbee2mqtt 2.x.x.

The other republishing of all states is done when Home Assistant publishes online to hass/status. This is however not linked against the Zigbee2mqtt startup (it can happen at any time). Given that this is only done when homeassitant: true and online is published to hass/status, I think this is perfectly fine).

sjorge commented 4 years ago

You might find that with NodeRed this is much easier than you are probably thinking.

Spend about 3 hours trying to get that working, doesn't look like it works with subflows at all. So although it might be 'easy' it will result in a lot of time rewriting things especially if sub flows don't work, that would mean massive redesigns or using other complex setups with file in and out nodes.

Yes, I agree that my setup is kind of shit. But pretty sure most setup out there are not well designed and thought out. Doesn't mean it should not be removed from z2m... but it will cause a mass of issues with people with broken setups.

sjorge commented 4 years ago

I update one minor version and the behavior nor works in node-red, will still take a while to go over all my flows though.

I guess this is probably indeed a good candidate for a 2.0 release.

uncle-fed commented 4 years ago

doesn't look like it works with subflows at all.

Just a shot in the dark, since I have zero knowledge about your setup. What works easiest for persistence, including all sub-flows, is using 'global' context with persistence. You write to global just using global.<something> and you read from it using JSONata function $globalContext(variable, 'persistence'). This is if you want to avoid custom JavaScript functions in your nodes. I am able to get away with just using that for 99.9% of my rather complex logic with multiple subflows. Global context is available everywhere and you can just separate things logically in global context using sub-objects of a single 'global' object. For example, I have the following global objects: global.home.sensors.bathroom, global.home.switches.corridor, etc..

uncle-fed commented 4 years ago

Why was it implemented? I don't know.

Forgive me for lack of history knowledge. I joined the Zigbee + MQTT game not so long ago. I thought you were the one and only original author of zigbee2mqtt who might have known why it was done this way :-)

Ideally this shouldn't be the default behaviour, or this shouldn't be in at all.

Thank you! Finally we arrive to what I was trying to say in a rather complex way :)

removing it now would introduce a breaking change, an option to disable this seem to be the correct way forward.

Sure. And until then I'll just delete state.json before starting zigbee2mqtt which will give me the behaviour of the two not yet existing options: cache_state_persistent: false and cache_state_send_on_startup: false

uncle-fed commented 4 years ago

One question about how current state cache works. I have a wall switch that sends out action attribute when somebody presses the button. Same switch also sends battery reading as another independent event from time to time. Will the action attribute be "re-transmitted" together with the battery, if state cache is enabled? This would not make a whole lot of sense, would it? And if this does not happen, what and where makes sure that the action attribute is "not cached"?

sjorge commented 4 years ago

And if this does not happen, what and where makes sure that the action attribute is "not cached"?

I can answer this one: https://github.com/Koenkk/zigbee2mqtt/blob/23805c25af7b255f1094f76e8f4b6d6e67d29256/lib/state.js#L8

There are certain attributes that are not cached.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

uncle-fed commented 4 years ago

Removing stale status. Hoping to see both discussed options cache_state_persistent and cache_state_send_on_startup implemented soon..

Koenkk commented 4 years ago

Sorry for taking so long, implemented both in the dev branch, docs: https://github.com/Koenkk/zigbee2mqtt.io/blob/develop/docs/information/configuration.md

Can you confirm & close the issue?

uncle-fed commented 4 years ago

I can confirm that both options function as expected. Thanks a lot! The only thing I would adjust slightly is the documentation.

I would change this:

# Optional: send cached state on startup, only used when cache_state: true (default: true)
  cache_state_send_on_startup: true

to:

# Optional: send cached state on startup, only used when cache_state_persistent: true (default: true)
  cache_state_send_on_startup: true

Because the enabled option cache_state_send_on_startup only make sense when both other cache options are enabled.

Koenkk commented 4 years ago

Done, thanks! Assuming this can be closed now.