Open eriklindqvist opened 4 years ago
[ozw.mqtt.publisher] [debug]: Publishing Event valueAdded: 562952802893846
I'm gathering that the hundreds of these messages, the number is changing?
If so, yes, I have a idea. We are not yielding to allow the actual network processing to happen. If you can confirm the above, the fix should be simple.
Yes, exactly. The numbers are changing for every row.
Also, it's not only valueAdded
, it can be all sorts of different z-wave stuff, such as nodeGroupChanged
, valueRefreshed
etc.
Same issue here. It also happens sometimes on startup of the qt-ozw-contianer. Tried to adjust the keepalive_timeout in mosquitto.conf, but mosquitto did not like that, and would not start.
It seems like when the mqtt-disconnect happens, the ozwdaemon is stuck using 100% CPU. I thought it would exit (and restart the ccontainer?)
I am also experiencing this issue, occurring when the ozw container is starting up. Resulting in a client timeout and disconnection.
In the previous zw1.4, ozwlog shows I have a very chatty network so assuming this is causing too many messages, and ozw is delayed in sending a keepalive to mqtt.
As with OP, the only way for the ozwd to remain connected to mqtt (not getting timeout) is to trash the ozwcache file.
Any progress on this one? I have a large network as well and can't get past the initialization step without this timeout and eventually shutdown of ozwdaemon.
1602753420: Client qt-openzwave-1 has exceeded timeout, disconnecting.
@sirfooey I can't find any ozwcache file, where is it located?
@jlengq Check your OZW container volume location, it should be right there.
Thanks, I found it! Does unfortunately not solve my problem ,
I have 100+ nodes and the interview process launched at startup seems to choke the MQTT network somehow, resulting in the timeout. Deleting the ozwcache only seems to start the whole process over again?
@jlengq yes, trashing the file starts the whole discovery process again (no loss of any node data); but sometimes that's what is needed to get it fully operational. Some people have reported better stability with Build 150 (docker pull openzwave/ozwdaemon:allinone-build-150), might be worth trying that as well.
I've now moved from qt-openzwave to zwave2mqtt. I noticed that during some operations while the controller is waiting for replies and they take a long time to come (and timeout, usually), z2m also shows as "disconnected", but after a while it reconnects. This is probably what is making z2m more reliable: it reconnects to mqtt.
Has anyone tried increasing the MQTT client timeout in ozwd to see if that works around the problem? Maybe increasing the timeout would allow ozwd to finish whatever it's doing before disconnecting (assuming it's not in an infinite loop).
I think it should be as simple as adding a call to this->m_client->setKeepAlive(360);
in the code below. That would set the timeout to 3 mins instead of the default 1 minute. Adjust as necessary or use preferably set via an environment variable.
Not sure if there's any downside to increasing the time besides not reacting as quickly for real timeouts.
So far with limited testing, I can confirm with @kpine's suggestion, ozw is able to start up with a pre-existing ozwcache file, whereas in the past, I would 100% of the time get timeout disconnects until I deleted the ozwcache file.
Just added a PR for that. I still think there are better fixes to be made, but at least it will make the daemon start up.
I too have encountered this issue. I have rebuild my entire HA setup around using Docker such that I can test if turning off logging and what not improves the situation (which it did, but it still fails to talk to MQTT sometimes).
I also took a look at how the ping/pong works and from what I can tell it's all built into qtmqtt using a timer, the only reason I can imagine that the timer wouldn't fire is if the event loop was blocked.
I am having the same issue, 100+ nodes ozw daemon won't stay up.
I have just added additional devices to my zwave network. I am now having the same issue. ozw daemon goes offline during startup.
So I spent an ungodly amount of time figuring out how to get a local debug build of ozwdaemon running on my MBP. I can't exactly reproduce the issue on my MBP (presumably because its too fast to trigger the issue), but what I do see is that the MQTT timer for pings is being invoked as expected (I also set up a 1s timer which triggers precisely on time). I also confirmed that blocking the event loop for a period of time definitely causes the MQTT timer not to be invoked. After digging into the code a bit, it looks like OZWNotification schedules it's events to be processed by the main thread using Qt::QueuedConnection, I have a suspicion that what's going on is that Open-Zwave is generating events so quickly that the Pi cannot keep up on the main thread. This causes the queue of queued signals to become saturated with queued events from OZW, which leads to the timer not being able to get in to fire. Something else that makes this seem likely is that I did some profiling and a huge amount of time is spent serializing events for MQTT, which is likely what is making each event take so long to be processed on the main thread. Assuming this is an accurate assessment, I see a couple of paths forward.
What do you think @Fishwaldo ?
I totally agree that #185 is just a workaround, and for me, yout option 3 seems like the best one. Unfortunately I don't have enough experience with C++ and QT to help.
Regarding option 2, there has been some discussion in another issue (could not find it here and now), and a problem is apparently that it is hard to keep track of the states (and what messages in either direction that might be lost) if you just do a MQTT-reconnect without restarting OZW at the same time.
I really hope fishwaldo is well, and will be back from his involuntary break soon...
So I spent an ungodly amount of time figuring out how to get a local debug build of ozwdaemon running on my MBP. I can't exactly reproduce the issue on my MBP
This is where I am at right now :)
docker buildx build --platform linux/arm -f Docker/Dockerfile -t qt-ozw-allinone-timeout .
WARN[0000] invalid non-bool value for BUILDX_NO_DEFAULT_LOAD:
WARN[0000] No output specified for docker-container driver. Build result will only remain in the build cache. To push result image into registry use --push or to load image into docker use --load
[+] Building 854.4s (16/44)
quickly that the Pi cannot keep up on the main thread. This causes the queue of queued signals to become saturated with queued events from OZW, which leads to the timer not being able to get in to fire.
So the classic CS problem of thread starvation?
Man I am rusty with C++, but having non-working lights is making me want to pick it back up.
Until there's an actual fix for this issue, is there any way to take CPU resources from the core-openzwave docker container to artificially slow it down so that it doesn't overwhelm mqtt? Perhaps Portainer has something useful. I'll check this idea out tomorrow, because otherwise I'm just dead in the water.
what I did on my end as a workaround is to install mqtt locally on my pi and then bridged it to my main mqtt server. It seems to fix the issue at least for me. I also tried recompiling ozwd to extend the timeout but that didn't quite solve the issue completely.
try this docker compose
version: '3'
services:
mqtt:
image: eclipse-mosquitto
container_name: "mqtt-bridge"
volumes:
- ./mqtt:/mosquitto
- ./mqtt/data:/mosquitto/data
- ./mqtt/log:/mosquitto/log
ports:
- "1883:1883"
- "9001:9001"
restart: always
ozwd:
image: openzwave/ozwdaemon:latest
container_name: "ozwd"
depends_on:
- "mqtt"
security_opt:
- seccomp:unconfined
devices:
- "/dev/serial/by-id/usb-xxx"
volumes:
- ./ozw:/opt/ozw/config
ports:
- "1983:1983"
- "5901:5901"
- "7800:7800"
environment:
MQTT_SERVER: "pi.local.net"
MQTT_USERNAME: "[redacted]"
MQTT_PASSWORD: "[redacted]"
USB_PATH: "/dev/serial/by-id/usb-xxx"
OZW_INSTANCE: "1"
OZW_NETWORK_KEY: "[redacted]"
restart: always
Here is mqtt config
persistence true
persistence_location /mosquitto/data/
log_dest file /mosquitto/log/mosquitto.log
password_file /mosquitto/config/passwd
allow_anonymous false
# External MQTT Broker
connection zpie01
address hassio.local.net
topic OpenZWave/1/# both # <-- 1 is the id of one of many of my instances update as needed
remote_username [redacted]
remote_password [redacted]
Depending on the MQTT server you have, the performance of your device and the size of the network, you could improve things with those kinds of changes. With my network of 58 nodes, even with a local MQTT and Pi4, it couldn't complete quickly enough.
I'll try to remember to push my docker image with Olen's workaround (which is perfectly reasonable to use in "production") tomorrow morning.
Cheers, Brett
On Tue., Dec. 1, 2020, 8:13 p.m. m3ki, notifications@github.com wrote:
what I did on my end as a workaround is to install mqtt locally on my pi and then bridged it to my main mqtt server. It seems to fix the issue at least for me. I also tried recompiling ozwd to extend the timeout but that didn't quite solve the issue completely.
try this docker compose
version: '3'services: mqtt: image: eclipse-mosquitto container_name: "mqtt-bridge" volumes:
- ./mqtt:/mosquitto
- ./mqtt/data:/mosquitto/data
- ./mqtt/log:/mosquitto/log ports:
- "1883:1883"
- "9001:9001" restart: always ozwd: image: openzwave/ozwdaemon:latest container_name: "ozwd" depends_on:
- "mqtt" security_opt:
- seccomp:unconfined devices:
- "/dev/serial/by-id/usb-xxx" volumes:
- ./ozw:/opt/ozw/config ports:
- "1983:1983"
- "5901:5901"
- "7800:7800" environment: MQTT_SERVER: "pi.local.net" MQTT_USERNAME: "[redacted]" MQTT_PASSWORD: "[redacted]" USB_PATH: "/dev/serial/by-id/usb-xxx" OZW_INSTANCE: "1" OZW_NETWORK_KEY: "[redacted]" restart: always
Here is mqtt config
persistence truepersistence_location /mosquitto/data/ log_dest file /mosquitto/log/mosquitto.log password_file /mosquitto/config/passwdallow_anonymous false
External MQTT Brokerconnection zpie01address hassio.m3ki.nettopic OpenZWave/1/# both # <-- 1 is the id of one of many of my instances update as neededremote_username [redacted]remote_password [redacted]
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenZWave/qt-openzwave/issues/140#issuecomment-736977908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAML466463XNO4Q4OWZ3YMDSSW5FRANCNFSM4PLIQ25Q .
It looks like adding " --cpu-shares 512 " might be a good start. Would any of you folks here know how I can get that into the command line that HA is using to start up the addon?
Depending on the MQTT server you have, the performance of your device and the size of the network, you could improve things with those kinds of changes. With my network of 58 nodes, even with a local MQTT and Pi4, it couldn't complete quickly enough. I'll try to remember to push my docker image with Olen's workaround (which is perfectly reasonable to use in "production") tomorrow morning. Cheers, Brett … On Tue., Dec. 1, 2020, 8:13 p.m. m3ki, @.***> wrote: what I did on my end as a workaround is to install mqtt locally on my pi and then bridged it to my main mqtt server. It seems to fix the issue at least for me. I also tried recompiling ozwd to extend the timeout but that didn't quite solve the issue completely. try this docker compose version: '3'services: mqtt: image: eclipse-mosquitto container_name: "mqtt-bridge" volumes: - ./mqtt:/mosquitto - ./mqtt/data:/mosquitto/data - ./mqtt/log:/mosquitto/log ports: - "1883:1883" - "9001:9001" restart: always ozwd: image: openzwave/ozwdaemon:latest container_name: "ozwd" depends_on: - "mqtt" security_opt: - seccomp:unconfined devices: - "/dev/serial/by-id/usb-xxx" volumes: - ./ozw:/opt/ozw/config ports: - "1983:1983" - "5901:5901" - "7800:7800" environment: MQTT_SERVER: "pi.local.net" MQTT_USERNAME: "[redacted]" MQTT_PASSWORD: "[redacted]" USB_PATH: "/dev/serial/by-id/usb-xxx" OZW_INSTANCE: "1" OZW_NETWORK_KEY: "[redacted]" restart: always Here is mqtt config persistence truepersistence_location /mosquitto/data/ log_dest file /mosquitto/log/mosquitto.log password_file /mosquitto/config/passwdallow_anonymous false # External MQTT Brokerconnection zpie01address hassio.m3ki.nettopic OpenZWave/1/# both # <-- 1 is the id of one of many of my instances update as neededremote_username [redacted]remote_password [redacted] — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#140 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAML466463XNO4Q4OWZ3YMDSSW5FRANCNFSM4PLIQ25Q .
For me I have 4 pis with 50-100 nodes or so and nothing would work. Even on the test ozwd instance where there were no nodes (what gives !?) OZWD would just disconnect with or without the keepalive modification in the code, iff the mqtt server was external to hassio. eclipse/mosquitto on docker on a beefy VM.
I have a feeling now that there is something going on with a network connection between the pi and the mqtt server. Things worked better if pi would connect to internal mqtt addon on hass server.
If you need to make sure startup completes you can wipeout ozwd cache on your pi that might help, just don't reset your zwave stick and all nodes will come back.
For me my network has been rock solid as of this morning with a setup I mention in my previous comment. PI (running ozwd and mqtt container) --->bridged to hassio mqtt.
By network connection, i meant there is something going on with how ozwd handles traffic and/or congestion. Even with the keepalive increased network wouldn't be stable. At least in my case.
I'm running a supervised install on Ubuntu 20.4 on an RPi4, with the Openzwave addon and the MQTT addon. I've been working to get everything back to normal after switching from the old all-in-one Zwave integration, and I thought I had finally got things right when this issue cropped up.
I'm running a supervised install on Ubuntu 20.4 on an RPi4, with the Openzwave addon and the MQTT addon. I've been working to get everything back to normal after switching from the old all-in-one Zwave integration, and I thought I had finally got things right when this issue cropped up.
A friend did this build yesterday when we were trying to troubleshoot this issue you can try his docker https://hub.docker.com/r/firstof9/qt-ozwdaemon it has an increased timeout.
Keep in mind it kinda worked for me but I would still experience intermittent disconnects every hour or so or when network got busy.
I am now back to the original docker though, with mqtt running on same host and bridging mqtt to mqtt with hassio on a separate VM. So far so good.
You can also wipeout ozwd cache file of your ozwdaemon and see if you can get your network back up.
Adding several more devices to my network caused this problem for me last weekend. I'm around 60 devices now with more to add. I'm running using VirtualBox on an Intel NUC so it doesn't appear easy to make any temporary changes to work around this issue. Last night I switched to Zwave2MQTT and have all my devices connected this morning. I'll have to spend some time renaming devices/entities but at least this will get me going again.
I really need a fix for this. The system fails too frequently as I am trying to add some door sensors and deadbolts. Deleting the cache file will let it restart, but many nodes lose their names, and some disappear. And it takes forever, anyway.
I tried Zwave2MQTT last week, and I switched when I ran into some issue (can't remember what it was now, lol). I may switch back and try it again. I think one issue was that it was clear from what I read that it was an orphaned project, and that the OpenZwave add-on with the OpenZwave integration was the road forward.
Is there any way to get the developers' attention quickly, or should I assume that this state of affairs will persist for a while?
Hey @psgcooldog, I am not a developer of qt-openzwave, but I am a C++ developer. I spent a bunch of time looking into this issue and I am reasonably confident that the fixed timeout should be sufficient for most networks. I have deployed that fix to my own network and have not seen any issues since. If your running into issues still, even after building and deploying the pull requests fix, it's likely there is another issue at play. Cheers, Brett
Hey @brett19 would this solution be better than simply setting an arbitrary timeout? this way anyone can adjust a timeout and tweak it as needed?
QString mqtt_keep_alive = qgetenv("MQTT_KEEP_ALIVE");
if (!mqtt_keep_alive.isEmpty()) {
this->m_client->setKeepAlive(mqtt_keep_alive);
}
I am having a hard time setting up a crosscompilation buildchain. How did you get it to work?
Sorry for the delay, forgot to push this the other day. Here are some armhf (32-bit ARM) images including fix-185. https://hub.docker.com/r/brett19/ozwdaemon
I decided to punt, and converted everything over to ZWave2MQTT. It was quite the time-consuming process, but this particular problem is no longer an issue for me.
@psgcooldog I also jumped ship for z2m, but ozw is far superior when it comes to handling scene events from switches. It actually comes into HA as a scene event and not a regular state change. And z2m sends 4 state changes per button press, so you need to go deep into the event to figure out if it really is a new scene event.
tl;dr: would prefer to use ozw but had to switch to z2m because ozw can't handle more than ~20-25 devices before it breaks down...
@karl-gustav I was just trying to figure scenes out with z2m. I have Inovelli red dimmers that allow multi-tapping to create scenes. Had it working fine with ozw, but no luck with z2m. I'll probably give up for now and make the switch back to ozw when this bug is fixed. At least all my regular light automations are working again.
Hey Everyone, can you confirm that you still have issues with the image I posted above containing fix 185.
If you can upload logs, that would help track down your specific issue beyond what we've already discovered.
Cheers, Brett
tl;dr: would prefer to use ozw but had to switch to z2m because ozw can't handle more than ~20-25 devices before it breaks down...
FWIW, I run ozw with 61 devices, and it has been running solid for 7 weeks. But I have only added a few new devices during that time, not removed any, and not done any network refreshes or other tricks. Restarting the container, on the other hand, is usually causing trouble. But as soon as it starts up, it seems pretty stable.
@brett19 I'm not that familiar with Docker. Is it possible to use Portainer to replace my current ozwdaemon with the one you created?
@brett19 I'm not that familiar with Docker. Is it possible to use Portainer to replace my current ozwdaemon with the one you created?
You would need to shut down the container and spin up a new one with the same configuration but different image (as far as I know). I personally use docker-compose to make it easier to do that.
Maybe I’ll have some time over the holidays to figure that out. For now z2m is running well. The only thing I can’t figure out is scenes from my dimmers. Is @Fishwaldo the only person that can release a new version of the addon? Given that the HA roadmap seemed to be heading down the ozw path, it seems pretty risky if there is only one person that can release bug fixes/workarounds.
A friend compiled "this" fix and added my fix to add MQTT_KEEP_ALIVE environment variable to change timeout as needed docker here https://hub.docker.com/r/firstof9/qt-ozwdaemon
Before this fix my setup would still restart if I did "Refresh node"
MQTT_KEEP_ALIVE: "360"
My config is here (keep in mind I am using a local mqtt on the pi that bridges to a main mqtt sever to make sure ozw doesn't restart if HASS instance is restarted
version: '3'
services:
mqtt:
image: eclipse-mosquitto
container_name: "mqtt-bridge"
volumes:
- ./mqtt:/mosquitto
- ./mqtt/data:/mosquitto/data
- ./mqtt/log:/mosquitto/log
ports:
- "1883:1883"
- "9001:9001"
restart: always
ozwd:
#image: openzwave/ozwdaemon:allinone-latest
image: firstof9/qt-ozwdaemon:latest
container_name: "ozwd"
depends_on:
- "mqtt"
security_opt:
- seccomp:unconfined
devices:
- "/dev/serial/by-id/usb-0658_0200-if00"
volumes:
- ./ozw:/opt/ozw/config
ports:
- "1983:1983"
- "5901:5901"
- "7800:7800"
environment:
MQTT_SERVER: "localhost.mydomain.net"
MQTT_USERNAME: "[redacted]"
MQTT_PASSWORD: "[redacted]"
MQTT_KEEP_ALIVE: "360" <------ add keep alive like so
USB_PATH: "/dev/serial/by-id/usb-0658_0200-if00"
OZW_INSTANCE: "3"
OZW_NETWORK_KEY: "[redacted]"
restart: always
Hi, I have the same issues. Are there any solutions in sight? Or at least a temporary workaround for people like me, who run the official image on a raspberry pi?
I've tried switching to zwave2mqtt but I couldn't figure out how to add the devices and entities to HomeAssistant. Auto discover didn't work either. And I really don't want to have to do everything manually.
Is it correct that everyone who has a sufficiently large network using the OpenZWave Plugin is experiencing this issue?
Oh and also I'm quite new to all of this. So I'm JUST learing how to access certain files in the docker containers, getting OS SSH Access etc. This is how I've managed to delete the ozwcache file at least, so I don't have to install the whole thing from scratch. But everytime I "reset", start up the OpenZWave Plugin and let my network run for a bit it seems to randomly miss some entities.
Is there a way to restore the ozwcache from an old file without running into the timeout issue, maybe?
I'm desperate at this point.. so I'm thankful for any help :)
@genome-prime I was unable to figure out how to get the workaround installed on my setup. I ended up switching to z2m and was able to get auto discover to work. It did require that I re-enter all the entity names, which was a bit time consuming. So far it has been very stable. I have about 80 zwave devices. I had trouble getting the central scene figured out with z2m. I'm now looking at Node-Red with an MQTT node to grab the central scene info from there.
I'm giving up in frustration... I gave z2m another shot. This time my devices were auto detected and showed up in Home Assistant.
Unfortunately
I know there are customizations but I couldn't figure out where the config files are being stored and how to apply them and honestly I don't wanna have to customize anything, since OZW was detecting everything fine on its own (except my Fibaro Button which I can live without for now)
I know this might not be the right or best place for it but I need to get this off my chest: It would be much more fun if there were official, step-by-step, up-to-date Tutorials out there for setting up the different ways of a core feature (Z-Wave) of HomeAssistant.
I guess I should at least say some positive things too:
zwave2mqtt:
OpenZWave:
Hey Everyone, can you confirm that you still have issues with the image I posted above containing fix 185.
If you can upload logs, that would help track down your specific issue beyond what we've already discovered.
Cheers, Brett
Your docker image seemed to fix the issue for me!
I have a new and fresh installation of Home Assistant on Raspberry Pi per 25th December 2020 and have the same problem: "Client qt-openzwave-1 has exceeded timeout, disconnecting."
Should this bug even show up on a totally clean install?
Seems like it is the Aeotec Z-Wave Gen5 stick that doesn't work on Raspberry Pi4. At least 3 (of 4) hardware versions of the stick: https://community.home-assistant.io/t/sticky-aeotec-z-stick-gen5-raspberry-pi4/218405/25
Seems like it is the Aeotec Z-Wave Gen5 stick that doesn't work on Raspberry Pi4. At least 3 (of 4) hardware versions of the stick: https://community.home-assistant.io/t/sticky-aeotec-z-stick-gen5-raspberry-pi4/218405/25
Did you try this solution?
No, but I will get a cheap unpowered USB 2.0 hub. That will do the trick. Also it makes the stick further away from the Raspberry which is supposed to reduce interference.
@m3ki - thank you. That docker image has seem to have done the trick for getting my network back online. It was super frustrating - I could see through the admin gui that everything was up and running, could see it posting into MQTT, but for some reason HA kept saying that everything was "unavailable"...
I am running docker images ozwdaemon:latest and eclipse-mosquitto:latest as of yesterday on a Raspberry Pi together with Home Assistant 0.113.1. I have about 90 devices and over 500 entities.
As long as I don't really do anything, it's running pretty stable. However, using OZWAdmin-0.1.74, if I push the z-wave network a bit too far, such as healing and/or refreshing too many nodes at the same time, basically clogging the network with messages, the ozwdaemon mqtt client doesn't seem to be able to keep up. All entities in Home Assistant becomes unavailable, and I see the following in the mosquitto logs:
So I restart the ozwdaemon docker instance, and I see the folling in the Mosquitto logs:
Then it works for just about under two minutes before it gets disconnected again:
From what I understand (please, correct me if I'm wrong) that "k60"-part in the Mosquitto logs means "keepalive = 60", i.e. the MQTT client tells the broker when connecting that it will stay in touch with a ping message at least once every minute, and if that doesn't happen, the client will be disconnected.
I increased logging in mosquitto (by setting "log_type all" in mosquitto.conf) and also started ozwdaemon with
-e QT_LOGGING_RULES="*.debug=false;ozw.mqtt.publisher.debug=true"
and I can see in the mosquitto logs
while ozwdaemon continues to print out hundreds of rows such as
for several minutes until it realizes that the connection is gone:
The only way I can get it to stay up is to remove/rename the ozwcache_0xf7b52c8f.xml file and restart, but it doesn't feel like a good solution.
Any ideas on what's going on?