Open Kodemikkel opened 2 years ago
If having one system in production with light then user real Zigbee light switches and binding them to the Zigbee Light groups so they is always working if the host system or internal internet is having problem. Implanting light "HA way" you can always getting problem and all is not working and with Light groups its only one device that is falling and not 100% of the system.
The reason you need repower the device and having joining enable is that the coordinator is have its frame counter for the network key is out off sync or the system have restored one old backup of its after coming back after the power problem => all devices in the network is blocking then they thing its one replay attack (normal Zigbee security).
But how come if I manually cut the power to the host, everything works fine? I've never had that issue before when my host loses and restores power.
And also, HA does not really have anything to do with this, as the Z2M docker is running completely separately and it was the Z2M docker that had the issue.
Edit: Added some more information in the reply.
I had something similar happen to my docker-based instance 3 days ago after rebooting my server. Zigbee2MQTT version 1.24.0-dev (commit #c49f546) zigbee-herdsman (0.14.20)
Had the same issue today :(
Really a pitty that the system is not self-healing.
Zigbee2MQTT version
1.24.0-dev commit: [f7c6207](https://github.com/Koenkk/zigbee2mqtt/commit/f7c6207)
Coordinator type
ConBee2/RaspBee2
Coordinator revision
0x26720700
Coordinator IEEE Address
0x00212effff0656b9
Frontend version
0.6.77
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days
Same issue today. Power loss on both Z2M and lights / devices and a lot of devices gone when Z2M reboots. I have manually cut the power before without consecuences.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days
is this really stale? afaik there was no fix yet?
just had the same issue. Almost all devices gone after power outage, only 2 were present in z2m, both Hue motion sensors, the rest of 20 devices gone. I managed to get most of them back by resetting them but I have 4 Philips Hue outdoor lights that don't want to rejoin by themselves and I can't reset them. I'll need to remove them from the wall, get the serial number, add them to Philips Hue app and I think I'll keep them there, don't want to get the ladder out if a power outage happens again.
Using zzh stick. Also tried updating to latest coordinator firmware and keeping "Allow join" on all the time with the hope that some devices will rejoin by themselves but those 4 lights never did. Some IKEA buttons did rejoin when I pressed them but not all.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days
AFAIK not stale
What do you mean with "gone"?
What do you mean with "gone"?
- Are the devices missing from the z2m frontend OR
- Are non of the devices controllable?
Both
Yep for me it was also the same.. nearly all devices were not visible in Z2M frontend nor controllable in Homeassistent. I had also need to repair them.
Maybe there is some unwritten state in Z2M which creates an inconsistency when the Z2M system is killed hardly?
Did something special happen around during crash? Z2M won't just empty its data/database.db
file by itself
hmm i can not remember any special expect the power outage..
just checked what the database.db is.. and afaik this is just a json file? maybe this file just gets corrupted from time to time?
as no proper database is used the "database.db" is lacking corruption prevention. Maybe it would make sense to use something more robust here like a sqlite database?
further the last_seen is also stored in this file. so this file is going to have a lot write operations which could lead to a corruption while a power outage
What do you mean with "gone"?
- Are the devices missing from the z2m frontend OR
- Are non of the devices controllable?
Hi, sorry for the lack of clarity in my last response. The devices were present in z2m frontend but unreachable (all of them, 91 devices). None of the devices or groups were controllable. Coordinator were reflashed with same version but with no effect. Repairing all devices was necessary.
Just for your info, 80/91 devices are power line operated (not battery devices). Just in case that helps. I'm going to search if i have Database.db to share it in this conversation.
The workaround i have deployed is to connect Zigbee2mqtt machine to an UPS š¤£. At this time i think that z2m was writting just in the right moment and got corrupted. Only if i find the file we will know.
Thank you for your time
@eloo
as no proper database is used the "database.db" is lacking corruption prevention.
there is some corruption prevention, the db is first written to a temp path and then renamed (https://github.com/Koenkk/zigbee-herdsman/blob/f1c6a3887e9d7a763e9ec981543881716c75c5ff/src/controller/database.ts#L75). I agree that sqlite may be a better option but its also more complicated (and we are not sure yet this causes the issue).
further the last_seen is also stored in this file. so this file is going to have a lot write operations which could lead to a corruption while a power outage
not every last_seen state will rewrite the db, this is done occasionally
@adelaiglesia what did you see in the log when sending messages to the devices?
I don't know if it's the same but I describe my situation: stop ( maybe ungraceful ) of the container then all sensor are still paired but they receive no signal. I have only aquara devices: windows, temperature and water. Only the windows sensors when toggled works again; for the temperature/water sensors to work again I need to press the button ( not re-pair )
I just had the same problem after a short power outage Z2M no longer showed any devices in the web interface, however the log looked normal. The new database.db only contained 3 lines, 1 for the coordinator and 2 empty groups:
{"id":1,"type":"Coordinator","ieeeAddr":"0x...." ... }
{"id":2,"type":"Group","groupID":1,"members":[],"meta":{}}
{"id":3,"type":"Group","groupID":2,"members":[],"meta":{}}
All devices were gone. Luckily I was able to restore a backup from Home Assistant which contained the database.db. After restoring and restarting the addon all worked again no need for a lengthy repairing of all devices.
@tripplet okay.. that makes is more clear that the problem seems to be related to the database.db as restoring will fix it.
@Koenkk maybe as a quickfix the database.db can be duplicated every time? so maybe the old version will be just renamed with .bak or something like this? so we can easily restore every time
I will check if I can come up with an easy recovery solution. Something like:
on save of db:
database.db.bak
as you suggesteddatabase.db
(if this only completes partially we get this issue)database.db.bak
file if present@Koenkk sounds like a good solution. i also like the idea of the self healing check š
Checking the code again, I do not think corruption is the issue here.
database.db.tmp
(this is kinda the backup already)database.db.tmp
-> database.db
)If I read https://stackoverflow.com/questions/7054844/is-rename-atomic:
But
rename()
is still atomic in a very important sense: if you use it to overwrite a file, then you will end up with either the old or the new version and nothing else.
So either the old db is there or the new one, at least not a partial one like in https://github.com/Koenkk/zigbee2mqtt/issues/11759?notification_referrer_id=NT_kwDOACwkNbIzMjMzOTcyNjQ0OjI4OTI4NTM¬ifications_after=Y3Vyc29yOjI1#issuecomment-1217976444 .
I think the bug here is that z2m clears the database for some reason. But to debug this I need the herdsman debug logging of the run where the db is cleared
See https://www.zigbee2mqtt.io/guide/usage/debug.html on how to enable the herdsman debug logging. Note that this is only logged to STDOUT and not to log files.
Interesting seems to be a very rare problem as I have been running Z2M for years now without problems on two system and this was the first time it appeared. Admittedly there were only a handful times where there was a power outage or I had to hard reboot the system.
Could it be that the previous write in https://github.com/Koenkk/zigbee-herdsman/blob/6d43a67eefc5dd97641014c11a5295c3b6e518be/src/controller/database.ts#L82 was not synced to disk (still in some write cache) but the rename made it to the disk journal before the power outage as it is atomic. See https://stackoverflow.com/a/22823144 Not quite sure how this is handled in filesystems :shrug: (For reference Iām using XFS).
Maybe a https://nodejs.org/api/fs.html#fsfsyncfd-callback is necessary before the rename is executed? Not sure if this would slow down some execution?
So either the old db is there or the new one, at least not a partial one like in ...
The new database.db looked syntactically Ok It was like Z2M decided for some reason start from scratch, but no idea why, could be because of a corrupted file but might have other causes.
With a large enough user base even the rares errors start to appear at some point :smile: Thanks for looking into the problem
@Koenkk okay.. but maybe having a persistent backup would also be a good a idea so we could restore our instances if the bug happens agains. at least renaming a file is much easier than repair all devices :D
Note that this is only logged to STDOUT and not to log files.
i guess is going to be problem.. as we see the issue when an power outage happens. then the stdout is also lost
@eloo in that case there should be the database.db.tmp
file
i guess is going to be problem.. as we see the issue when an power outage happens. then the stdout is also lost
With this I mean the logging of the startup after the power outage (not before)
Same issue just happened to me. running unraid with z2m docker. I had to hard reboot my server and lost all my devices. DB is empty and z2m is showing 0 devices connected.
I just ran into the same issue. I'm also running the zigbee2mqtt docker container on UnRaid as the person above me and a couple other people (don't want to unnecessary tag anyone) have also mentioned.
I actually only recently seen this behavior (a couple times now), since I started to use the CC2652RB instead of the CC2531 which I switched to about a month ago. I'm not sure if that is related at all.
The last line in my log before the outage is the following (so nothing out of the ordinary):
debug 2022-09-10 04:18:55: Saving state to file /app/data/state.json
After the outage I noticed that zigbee2mqtt wasn't running at all, because I had renamed a device in the UI yesterday, but forgot to adjust its name in the availability_passlist
, so it didn't run for about 5 hours. Now obviously this isn't related to the issue at all, this was just my mistake, but maybe the length of the outage could be related (I honestly have no idea, just putting this out there).
After I changed its name in the availability_passlist
in my configuration.yaml
zigbee2mqtt booted up as expected, but no devices were to be found. There is something interesting in the log though:
Using '/app/data' as data directory
Zigbee2MQTT:debug 2022-09-10 09:35:53: Loaded state from file /app/data/state.json
Zigbee2MQTT:info 2022-09-10 09:35:53: Logging to console and directory: '/app/data/log/2022-09-10.09-35-53' filename: log.txt
Zigbee2MQTT:debug 2022-09-10 09:35:53: Removing old log directory '/app/data/log/2022-09-03.15-12-32'
Zigbee2MQTT:info 2022-09-10 09:35:54: Starting Zigbee2MQTT version 1.27.2 (commit #7dc48fb)
Zigbee2MQTT:info 2022-09-10 09:35:54: Starting zigbee-herdsman (0.14.53)
Zigbee2MQTT:debug 2022-09-10 09:35:54: Using zigbee-herdsman with settings: '{"adapter":{"concurrent":null,"delay":null,"disableLED":false},"backupPath":"/app/data/coordinator_backup.json","databaseBackupPath":"/app/data/database.db.backup","databasePath":"/app/data/database.db","network":{"channelList":[11],"extendedPanID":FOO,"networkKey":"HIDDEN","panID":BAR},"serialPort":{"path":"/dev/serial/by-id/usb-Silicon_Labs_slae.sh_cc2652rb_stick_-_slaesh_s_iot_stuff_00_12_4B_00_25_9B_66_57-if00-port0"}}'
Zigbee2MQTT:info 2022-09-10 09:36:04: zigbee-herdsman started (resumed)
Zigbee2MQTT:info 2022-09-10 09:36:04: Coordinator firmware version: '{"meta":{"maintrel":1,"majorrel":2,"minorrel":7,"product":1,"revision":20220219,"transportrev":2},"type":"zStack3x0"}'
Zigbee2MQTT:debug 2022-09-10 09:36:04: Zigbee network parameters: {"channel":11,"extendedPanID":"FOO","panID":BAR}
Zigbee2MQTT:info 2022-09-10 09:36:04: Currently 0 devices are joined:
Zigbee2MQTT:warn 2022-09-10 09:36:04: `permit_join` set to `true` in configuration.yaml.
Zigbee2MQTT:warn 2022-09-10 09:36:04: Allowing new devices to join.
Zigbee2MQTT:warn 2022-09-10 09:36:04: Set `permit_join` to `false` once you joined all devices.
Zigbee2MQTT:info 2022-09-10 09:36:04: Zigbee: allowing new devices to join.
Zigbee2MQTT:info 2022-09-10 09:36:05: Connecting to MQTT server at mqtt://192.168.1.10:1883
Zigbee2MQTT:debug 2022-09-10 09:36:05: Using MQTT anonymous login
Zigbee2MQTT:info 2022-09-10 09:36:05: Connected to MQTT server
Zigbee2MQTT:info 2022-09-10 09:36:05: MQTT publish: topic 'zigbee2mqtt/bridge/state', payload '{"state":"online"}'
Zigbee2MQTT:info 2022-09-10 09:36:05: Started frontend on port 0.0.0.0:9442
Zigbee2MQTT:info 2022-09-10 09:36:05: MQTT publish: topic 'zigbee2mqtt/bridge/state', payload '{"state":"online"}'
Zigbee2MQTT:debug 2022-09-10 09:40:53: Saving state to file /app/data/state.json
It mentions this file: /app/data/database.db.backup
as the databaseBackupPath
which isn't there at all. The coordinator_backup.json
also mentioned in the log does exist however.
After the startup my database.db
looked like the following:
{
"id": 1,
"type": "Coordinator",
"ieeeAddr": "0x00124b00259b6657",
"nwkAddr": 0,
"manufId": 0,
"epList": [
2,
1
],
"endpoints": {
"1": {
"profId": 260,
"epId": 1,
"devId": 5,
"inClusterList": [],
"outClusterList": [],
"clusters": {},
"binds": [],
"configuredReportings": [],
"meta": {}
},
"2": {
"profId": 257,
"epId": 2,
"devId": 5,
"inClusterList": [],
"outClusterList": [],
"clusters": {},
"binds": [],
"configuredReportings": [],
"meta": {}
}
},
"interviewCompleted": true,
"meta": {},
"lastSeen": null,
"defaultSendRequestWhen": "immediate"
}
Note: the epList
and endpoints
contain more, but I removed them to not make it too long.
Also... there is no database.db.tmp
at all. It would've been helpful if there was an actual backup file (e.g. database.db.backup
as mentioned in the log or the database.db.tmp
file as mentioned in this thread), which I could simply rename as some people have mentioned previously.
I now have to pair all of my devices again which is the third time in about a month (since I've gotten the CC2652RB). I might even wait for a possible fix to do that as it is quite annoying to get to some of the devices.
Actually, this just happened to me after updating from 1.26 to 1.27.2, using Docker containers and no power outage or reboots/hangs/anything else.
In fact, server is connected to UPS so this isn't even a possibility.
Somehow, it just happened with two devices at this home (Tuya WHD02/TS0001)
But it happened to another home where I have more Zigbee devices when updating from 1.25 to 1.26, one dimmer was working just perfect and right after the update, it was just gone, as said, with no power cuts or anything else.
I was thinking about some kind of syncing/changes made on the DB at the moment of updating the Docker container. :shrug:
Had the same issue the other night. Short power outage made my server reboot and only my Philips Hue motion sensors appeared, after they had detected motion.
Rolled back VM snapshot and everything was back to normal and I could finally turn off all the lights that turned on when the power came back. šµāš«
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days
not stale AFAIK
Just got the issue after power outage too, so stressing, luckcly i had a backup from a few days back.
Note : my database.db.backup was only 4ko, when my backuped database.db was 43ko !
Happened yet again. Sigh.
Just had a power outage too. In my case the database was ok, but all z2m devices were offline and couldn't be controlled.
Happened to me now twice too. Database was corrupted, and configuration.yaml isn't used to restore anything.
@skinkie can you provide me an example of how the data/database.db looked?
@skinkie can you provide me an example of how the data/database.db looked?
For me ever time it looked brand new !
I've added the fsync
call before the rename as suggested by @tripplet. Let's see if it still occurs after this.
Changes will be available in the dev branch in a few hours from now. (https://www.zigbee2mqtt.io/advanced/more/switch-to-dev-branch.html)
I saw this behaviour with 1.28.4 yesterday. Migrated from ZHA and had issues, zigbee2mqtt crashed many times during device pairing (the web front end stopped to acknowledging to pairing attempts and finally noticed the backend was down.
At one point, I had around 5 devices paired and backend crashed - when restarted, the devices were gone. Thus the service automatic restart function is not working either - had to manually start.
I had a very similar situation yesterday and while I had no power outage as far as I'm aware, I start to wonder if it's not somehow related : https://github.com/Koenkk/zigbee2mqtt/issues/15868
Hi, I just ran into a similar issue on 1.28.1 running in docker. I restarted the container and the WebUI was now depopulated of all devices and most other info (version numbers in about, map not working, .etc). Though looking at the log, it appears that devices are still paired and transmitting state and commands correctly. Here's what I tried to no avail:
Current version of Z2M: 1.31.2 commit: 21f51258 Conbee II, Firmware: 0x26580700
After a restart (requested through either the webinterface, or systemd restart for the Z2M process) Most devices are off-line. - not forgotten... just off-line and the need to be repaired. Devices that report are more likely to return than others. Devices that were unavailable (mains device offline by being turned off from the mains) also have a better chance of returning.
44 devices, 31 using mains, 12 on battery, one never bothered to report either battery/mains (does run on battery).
Is it possible that a single status request that for some device gets lost in the traffic during startup causing disabling the device somehow? Also observed (once noticed, no complete trackrecord) there is a failed poll in the log files BEFORE the stick has been registerd.
info 2023-06-27 10:29:17: Logging to console and directory: '/opt/zigbee2mqtt/data/log/2023-06-27.10-29-17' filename: log.txt
warn 2023-06-27 10:29:17: Failed to ping 'innr_plug1' (attempt 1/1, Read 0x18fc260000051121/1 genBasic(["zclVersion"], {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":true,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (no response received (17)))
info 2023-06-27 10:29:17: Starting Zigbee2MQTT version 1.31.2 (commit #21f51258)
info 2023-06-27 10:29:17: Starting zigbee-herdsman (0.14.117)
info 2023-06-27 10:29:17: zigbee-herdsman started (resumed)
info 2023-06-27 10:29:17: Coordinator firmware version: '{"meta":{"maintrel":0,"majorrel":38,"minorrel":88,"product":0,"revision":"0x26580700","transportrev":0},"type":"ConBee2/RaspBee2"}'
Additional: database.db entry for such a device that stubornly doesn't come back...
{
"id": 22,
"type": "Router",
"ieeeAddr": "0x847127fffea9ccab",
"nwkAddr": 10923,
"manufId": 4644,
"manufName": "ROBB smarrt",
"powerSource": "Mains (single phase)",
"modelId": "ROB_200-004-0",
"epList": [
1,
242
],
"endpoints": {
"1": {
"profId": 260,
"epId": 1,
"devId": 257,
"inClusterList": [
0,
3,
4,
5,
6,
8,
2821,
4096
],
"outClusterList": [
25
],
"clusters": {
"genBasic": {
"attributes": {
"modelId": "ROB_200-004-0",
"manufacturerName": "ROBB smarrt",
"powerSource": 1,
"zclVersion": 3,
"appVersion": 0,
"stackVersion": 0,
"hwVersion": 1,
"dateCode": "NULL",
"swBuildId": "2.5.3_r51"
}
},
"genOta": {
"attributes": {
"currentFileVersion": 51
}
},
"genOnOff": {
"attributes": {
"onOff": 0
}
},
"genLevelCtrl": {
"attributes": {
"currentLevel": 69,
"onLevel": 255
}
}
},
"binds": [
{
"cluster": 6,
"type": "endpoint",
"deviceIeeeAddress": "0x00212effff06747e",
"endpointID": 1
},
{
"cluster": 8,
"type": "endpoint",
"deviceIeeeAddress": "0x00212effff06747e",
"endpointID": 1
}
],
"configuredReportings": [
{
"cluster": 6,
"attrId": 0,
"minRepIntval": 0,
"maxRepIntval": 3600,
"repChange": 0
}
],
"meta": {}
},
"242": {
"profId": 41440,
"epId": 242,
"devId": 102,
"inClusterList": [
33
],
"outClusterList": [
33
],
"clusters": {},
"binds": [],
"configuredReportings": [],
"meta": {}
}
},
"appVersion": 0,
"stackVersion": 0,
"hwVersion": 1,
"dateCode": "NULL",
"swBuildId": "2.5.3_r51",
"zclVersion": 3,
"interviewCompleted": true,
"meta": {
"configured": 1461352984
},
"lastSeen": 1687851057212,
"defaultSendRequestWhen": "immediate"
}
Given the lack of new reports I think the fix works and this can be closed?
I avoided all updates until now, i will check next weekend, report or close whatever is appropriate.
No power outage...., update through update.sh
After update: (& waiting half an hour): All devices that send measurements (temperature, power usage, motion) are Online All switchable devices (lamps, relais) are gone. (switches that give a power reading are present).
Repairing of some devices has issues... (from 5 lamps 4 would re-connect after reset, 5th lamp no reconnect). One hour later no change.
Plugs/Switches are reported to be online, still giving errors:
Failed to read state of 'frients_plug1' after reconnect (Read 0x0015bc002f013410/2 genOnOff(["onOff"], {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":tue,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (no response received (48)))
This is equivalent for all plugs.
What happened?
After waking up from a night's sleep, I noticed in HA that some devices were unavailable. After some more looking around, I found all my Z2M devices were unavailable.
Heading over to my Z2M docker container running on an Unraid host, I saw none of my devices were listed anywhere, and "Devices", "Dashboard" and "Map" were all blank, although my settings were still the same. I do suspect that there was a power outage while I was sleeping, as all the lights had turned on and after booting my PC, there were some indications that it had lost power. After power cycling a device I had nearby and setting Z2M to allow joining, the device appeared as previously configured without needing to change anything manually.
Now I have power cycled/reset all my devices, and they are all fully functional as they were before. What is weird is that I have, on several occasions, unintentionally, removed power from my running Unraid host without any issues at all. This leads me to my questions:
I am lucky to only have about 30 devices on my network, and resetting them doesn't take that long, although it is still boring and tedious. I could only imagine how it would be for someone with 100s of devices on their network.
What did you expect to happen?
Not losing all my devices after a power outage.
How to reproduce it (minimal and precise)
No idea. As mentioned my host has lost power several times earlier without issues.
Zigbee2MQTT version
1.23.0-dev commit: afe94a7
Adapter firmware version
0x26720700
Adapter
ConBee2
Debug log
07MAR22 11:20:18.txt 08MAR22 08:30:21.txt
The first log is dated 07MAR22 11:20:18 and I assume the last line is right before the power is lost. (I can't find any useful information in this) The second log is dated 08MAR22 08:30:21 and I assume it would be from when the power came back. The shutdown at the end of this log is me trying to restart it.