dresden-elektronik / deconz-rest-plugin

deCONZ REST-API plugin to control ZigBee devices
BSD 3-Clause "New" or "Revised" License
1.9k stars 499 forks source link

deCONZ lost connection to RaspBee - heat problems? #147

Closed ebaauw closed 7 years ago

ebaauw commented 7 years ago

Today, deCONZ lost connection to the RaspBee. The deCONZ GUI was on the initial device selection screen, and wouldn't connect to the RaspBee. I could quit deCONZ through the menu. After resetting the RaspBee though GCFFlasher_internal -r deCONZ started normally.

Attached the relevant part (I hope) of the deCONZ log. log.zip

I wasn't doing anything on the ZigBee network, but I was compiling the REST API plugin at the time. Could it be the RaspBee overheated? I had a couple of hangs which I suspect to be heath-releated. The Raspberry would lose network connectivity until power cycled, while running make -j3 (which would drive the CPU utilisation close to 100%). I've been using plain make since a couple of days and haven't seen any issues since. I run the Raspberry Pi headless, only connected to wired ethernet, and use VNC and SSH to connect to it.

ebaauw commented 7 years ago

Just to be sure, I set up some CPU temperature monitoring on both my Raspberry Pi machines. On the production Pi (with the RaspBee), the CPU temperature is 63.4°C when just running deCONZ and homebridge. deCONZ consumes ~20% CPU. homebridge takes ~2% CPU. On the test Pi (no add-on board), currently just running Raspbian, the CPU temperature is 54.8°C. This is, of course, if you believe the onboard sensors (reported through vcgencmd measure_cpu).

Some googling makes me understand that, indeed, the Raspberry Pi throttles the CPU speed (measured through vcgencmd measure_clock arm) when the CPU temperature rises above ~80°C. I also found references that the Pi would shutdown if it's still too hot - that would explain my apparent hangs the other weeks.

Let's do some simple stress tests:

It looks like make -j2 is the optimum choice. I don't suppose I could cross-compile the REST API plugin on macOS?

On my Raspberry Pi 3 B, the RaspBee sits half above the CPU, making me think a hot CPU could impact the RaspBee. I use cases with some ventilation holes, but I don't suspect there's much airflow. I ordered some (low-height, copper) heat sinks, just in case, but I'm not sure that will help the RaspBee. Will post back after I've installed and tested them.

EDIT
I've moved my development off the production Pi to the test Pi (for now, without a RaspBee). During a make -j3 the CPU temperature rises to 83.3°C. The CPU is throttled to 872MHz. The build takes 4:38 minutes.

manup commented 7 years ago

Very interesting tests, I'll will try the make -j2 and will update README.md.

It looks like make -j2 is the optimum choice. I don't suppose I could cross-compile the REST API plugin on macOS?

It might be possible but I don't know how, maybe QEMU can be used here. Would be very nice to get working cross compilation. But it also might cause tricky problems, we had issues running RPi 2/3 compiled binaries on RPi 1 while the other way around worked well. That's the reason we compile stable releases just on RPi 1, which takes ages :)

On the production Pi (with the RaspBee), the CPU temperature is 63.4°C when just running deCONZ and homebridge. deCONZ consumes ~20% CPU.

When you minimize the deCONZ window cpu consumption should drop to 4–6 %, something in the GUI is quite heavy, it's a known issue but I'm afraid will take some time to fix it.

As alternative deCONZ might be started completely headless via systemd or by using the same deCONZ commandline arguments the systemd script uses.

ebaauw commented 7 years ago

Installed a copper heatsink on the Raspberry's CPU, and, from this set the copper plate on the RAM, and the small aluminium heatsink on the USB/Ethernet controller. The CPU sink doesn't fully cover the CPU (the green aluminium sink from the other set is a bit larger (and higher), but I haven't tried that one).

The production Pi's CPU is now a bit cooler than when just running deCONZ (full screen) and homebridge: 60.1°C vs 63.4°C without the heat sinks. This drops to 59.1°C with the deCOZN window minimised. The CPU on the test Pi, now also with a RaspBee installed, but not running deCONZ also is 59.1°C (vs 54.8°C the other day).

The proof of the pudding: a clean make -j3 on the test Pi now takes 4:55 minutes. The CPU temperature rises to 83.3°C and the CPU is throttled to 868MHz. Slightly worse than without RaspBee and heat sinks, but better than with RaspBee and no heat sinks.

Of course, these aren't controlled tests in a laboratory environment, but I think it's safe to conclude that the RaspBee shield does impact the Raspberry Pi's airflow, causing a higher CPU temperature. The heat sinks reduce the temperature, but not enough to negate fully the effect of the RaspBee.

ebaauw commented 7 years ago

This morning, all lights in the deCONZ GUI were showing red, and the RaspBee doesn't seem to do any ZigBee communication at all. deCONZ and the REST API plugin are fully functional - my wakeup schedule fired and set the CLIP status. Also, the ZigBee network is fully functional - the lights react to the dimmer switches alright.

Disconnected / reconnected the RaspBee to the ZigBee network - RaspBee node shows, status "light" blinking blue, but no nodes are found. Exit and restart deCONZ: no change. Exit deCONZ, reset the RaspBee through sudo GCFFlasher_ internal -r start deCONZ: no change. Power down the Raspberry Pi and restart it: no change.

Checked the network settings in de deCONZ GUI: that's a different channel, and PANID (and network key, I suppose). Did the RaspBee lose the settings in its non-volatile memory? deCONZ v2.04.70, RaspBee firmware 26160500.

I don't think this is heat-related. The CPU temp remained below 60 throughout the night.

EDIT Entered the previous values in the network setting (am I glad I took a note of these!) and deCONZ is finding my devices.

manup commented 7 years ago

Checked the network settings in de deCONZ GUI: that's a different channel, and PANID (and network key, I suppose). Did the RaspBee lose the settings in its non-volatile memory? deCONZ v2.04.70, RaspBee firmware 26160500.

Oh damn, it seems so :/ this exact problem (loosing channel) is on the high priority list since recently, we have seen the issue before. The bug is not restricted to latest firmware, it's been there for a while and we are on it.

If you write back the actual channel and panid and then disconnect and rejoin it should work again. In our experience network key was not destroyed so it might still work.

ebaauw commented 7 years ago

If you write back the actual channel and panid and then disconnect and rejoin it should work again. In our experience network key was not destroyed so it might still work.

I have, and indeed it works again (after taking a ridiculously long time to re-discover the entire network). I think it did lose the network key as well, but I'm not sure (unlike the panid and channel, I wouldn't recognise it). I bluntly wrote it back anyways.

imammedo commented 6 years ago

If you write back the actual channel and panid and then disconnect and rejoin it should work again. In our experience network key was not destroyed so it might still work.

It looks like this thing (network loss) happened to me as well, ( I'm using usb stick attached to VM with Fedora).

I suppose that I can 're-discover' right channel by brute force but is there a way to find out what panid it used to have? PS: (I still haven't reset all lights so they are still on the old network)

manup commented 6 years ago

You can do a Touchlink scan in the webapp under settings, it should show the devices + channel and panid.

imammedo commented 6 years ago

Do you mean "scan for devices" button in "Reset Devices via Touchlink" section? /running 2.04.84/ Wouldn't it actually reset a device? /Well I guess I can sacrifice a one sensor for experiment/

manup commented 6 years ago

Yes scan for devices, the scan alone only lists lights on other channels.

imammedo commented 6 years ago

Pressing button didn't find anything and search finished. I've almost given up on it and were looking for some clues in debug output but then I've looked at settings page again and it showed a tradfri dimmer that I've put next to usb stick (it was displayed as light) with "Channel" and "Network Id". Once I've changed deconz to them it joined network successfully and all devices were rediscovered fine.

Thanks for prompt help and saving me from the dark night :)

ebaauw commented 6 years ago

Pressing button didn't find anything and search finished.

With their latest firmware, you need to power-cycle Hue lamps to make them discoverable for like 30 minutes.

imammedo commented 6 years ago

Maybe following could improve deconz troubleshooting/usability

  1. storing network configuration in deconz config
  2. reporting error on startup if deconz and non-volatile memory doesn't match
  3. exit with error code so that systemd could notify user about broken network
  4. add an option to force push deconz network config on device

Well, it turned out sort of like a feature req.

ebaauw commented 6 years ago

Got a new Rapsberry Pi 3 B+, running at 1.4 GHz. My setup has changed from the test last August, so results are incomparable. I now keep the sources on my MacBook, and mount them over CIFS on the Pi. Also I have a different case, without airflow holes (the new PoE header doesn’t fit the other case). I did attach the same 3 heat sinks and installed the RaspBee. Finally, I’ve overclocked the SD card access.

I’ve yet to move production to the 3 B+, but running VNC on my test Pi is much smoother, especially zooming in the deCONZ GUI.