dresden-elektronik / deconz-rest-plugin

deCONZ REST-API plugin to control ZigBee devices
BSD 3-Clause "New" or "Revised" License
1.9k stars 502 forks source link

Mesh problems with IKEA devices #195

Closed donnib closed 5 years ago

donnib commented 7 years ago

I see problems with IKEA meshing (i pressume that's the problem) together with other devices from other makers for example i see following cases quite often :

  1. I can't control my osram lightpole lights even though they are very close to my IKEA lights which they mesh thru. If i click on it then use 0 they usually come back to life.
  2. My Hue Dim switch lights up red when i click on the buttons, clicking 0 in deconz doesn't help, it usually comes back by itself after quite some time.

@manup do you have a suggestion how i could troubleshoot this further to find the root cause to this ?

ebaauw commented 7 years ago

Cool. What gamut does it support? Does it do attribute reporting on Current X and Current Y? What's the model?

How do you derive xy from ct? That's been a long outstanding question on the Hue developer's forum where Philips support is very quiet. Others have given some formulas, but I haven't tested these.

mariusmotea commented 7 years ago

I made wifi esp8266 based lights firmware for my project that accept all xy,ct and hue parameters. All conversion formulas are in code, but from ct to xy directly i don't have, you will need to convert first to rgb

Marius.

manup commented 7 years ago

Cool. What gamut does it support?

Sadly it doesn't tell, the primary values of the color cluster return unsupported, but it sure is wider than the E27 hue color bulb since it goes deeper into green and blue.

Does it do attribute reporting on Current X and Current Y?

I guess yes, the IKEA gateway uses reporting for almost everything, need to write a bit more code to set it up.

What's the model?

TRADFRI bulb E27 CWS opal 600lm

And it has ct min/max = 250/454.

How do you derive xy from ct?

Color temperature is the black body curve in CIE xy. I don't have a formula though.

image

mariusmotea commented 7 years ago

Convert ct to rgb:

void convert_ct() {
  int hectemp = 10000 / ct;
  int r, g, b;
  if (hectemp <= 66) {
    r = 255;
    g = 99.4708025861 * log(hectemp) - 161.1195681661;
    b = hectemp <= 19 ? 0 : (138.5177312231 * log(hectemp - 10) - 305.0447927307);
  } else {
    r = 329.698727446 * pow(hectemp - 60, -0.1332047592);
    g = 288.1221695283 * pow(hectemp - 60, -0.0755148492);
    b = 255;
  }
  r = r > 255 ? 255 : r;
  g = g > 255 ? 255 : g;
  b = b > 255 ? 255 : b;
  rgb[0] = r * (bri / 255.0f); rgb[1] = g * (bri / 255.0f); rgb[2] = b * (bri / 255.0f);
}

then from rgb to xy:

Get the RGB values from your color object and convert them to be between 0 and 1. So the RGB color (255, 0, 100) becomes (1.0, 0.0, 0.39)
Apply a gamma correction to the RGB values, which makes the color more vivid and more the like the color displayed on the screen of your device. This gamma correction is also applied to the screen of your computer or phone, thus we need this to create a similar color on the light as on screen. This is done by the following formulas:

float red = (red > 0.04045f) ? pow((red + 0.055f) / (1.0f + 0.055f), 2.4f) : (red / 12.92f);
float green = (green > 0.04045f) ? pow((green + 0.055f) / (1.0f + 0.055f), 2.4f) : (green / 12.92f);
float blue = (blue > 0.04045f) ? pow((blue + 0.055f) / (1.0f + 0.055f), 2.4f) : (blue / 12.92f); 
Convert the RGB values to XYZ using the Wide RGB D65 conversion formula The formulas used:

float X = red * 0.664511f + green * 0.154324f + blue * 0.162028f;
float Y = red * 0.283881f + green * 0.668433f + blue * 0.047685f;
float Z = red * 0.000088f + green * 0.072310f + blue * 0.986039f;
Calculate the xy values from the XYZ values

float x = X / (X + Y + Z);
float y = Y / (X + Y + Z);

This is from hue api documentation link

ebaauw commented 7 years ago

Thanks, @mariusmotea. Where did you get the ct to rbg formula from? Do you also have the inverse (rgb to ct)?

mariusmotea commented 7 years ago

Sorry, i don't have the rgb to ct formula and I don't remember from where i get the one from ct to rgb, maybe google can help. I will try to help on this problem to develop an optimized ct to xy formula based on that curve.

ebaauw commented 7 years ago

Couldn't resist and got myself an IKEA color bulb. SW Build 1.3.002, Date Code 20170315. Firmware image is 0x2801, but no file included in the IKEA downloads.

It's a ZLL Color Light, so it doesn't support colour temperature. It reports Color capabilities (0x400A) as only CIE 1931 XY, so no hue/sat and no colorloop either. A bit odd, but probably still compliant with ZLL? Indeed, it reports Color temperature min (0x400B) and Color temperature max (0x400C), but that doesn't make sense. My Hue color lights don't report these attributes.

The REST API reports the light as follows:

{
  "etag": "99e349df49ba922027e4183bc522e8f5",
  "hascolor": true,
  "manufacturername": "IKEA of Sweden",
  "modelid": "TRADFRI bulb E27 CWS opal 600lm",
  "name": "Light 2",
  "state": {
    "alert": "none",
    "bri": 1,
    "colormode": "xy",
    "effect": "none",
    "hue": 0,
    "on": false,
    "reachable": true,
    "sat": 0,
    "xy": [
      9.19132e-05,
      9.19132e-05
    ]
  },
  "swversion": "1.3.002",
  "type": "Color light",
  "uniqueid": "00:0b:57:ff:fe:xx:xx:xx-01"
}

The REST API correctly doesn't expose state.ct (probably because it's a Color Light) but it should also suppress state.hue, state.sat, and state.effect (and config.ctmin and config.ctmax). Also note the funny formatting of state.xy.

The bulb doesn't seem to report the actual Current X and Current Y values like the Hue lights do - it echoes the values last set. It accepts the full range 1..65535. When setting Current Y to 0 (which is invalid), the light shows blue-ish and flickers. When changing brightness it flickers again. The deCONZ REST API has a bug: it should set Current X and Current Y to 1 when setting "xy": [0,0]. And there's still the issue that deCONZ maximises the values to 65279. Also it uses this number to convert the ZigBee uint16 value to a JSON float, instead of 65535, see #30, point 6.

manup commented 7 years ago

The REST API correctly doesn't expose state.ct (probably because it's a Color Light) but it should also suppress state.hue, state.sat, and state.effect (and config.ctmin and config.ctmax). Also note the funny formatting of state.xy.

The non existing values will be filtered out soon, I'm working on using the color capabilities attribute for that. I think the ctmin/ctmax might be ok since the lights does provide these values.

The deCONZ REST API has a bug: it should set Current X and Current Y to 1 when setting "xy": [0,0].

Why 1?

And there's still the issue that deCONZ maximises the values to 65279. Also it uses this number to convert the ZigBee uint16 value to a JSON float, instead of 65535, see #30, point 6.

There was a reason for that number, albeit I'd forgot which one, need to look in the spec and properly some ZLL firmware.

Forecast: There is an upcoming commit for the color cluster attribute reporting which will include xy and color mode.

manup commented 7 years ago

Here we go: http://www.dresden-elektronik.de/rpi/deconz/beta/deconz-2.04.80-qt5.deb

The version is experimental, after startup phase polling of IKEA and OSRAM will be dropped smoothly in favor for reporting. Further neighbor table requests will be slowed down after startup phase.

In contrast to that, lights which don't support attribute reporting will be polled much more frequently. There is still much room to optimize certain requests, which will be addressed in the next versions.

holli73 commented 7 years ago

@manup,

i did compile on x64 and after restart of deconz it did abort during lights logon

[913888.052943] deCONZ[850]: segfault at fa200420000 ip 00000fa200420000 sp 00007fff5663c178 error 14
[913920.796795] deCONZ[9900]: segfault at 18 ip 00007fc5025d14e3 sp 00007fffca2a5cd8 error 4 in libdeCONZ.so.1.2.0[7fc5025b6000+4b000]

next start is still running and all 61 devices are available.

best regards holli

manup commented 7 years ago

Thanks for feedback, might be related to the updated core. I'll release 2.04.80 for x64 with dev-packages later today.

ebaauw commented 7 years ago

The deCONZ REST API has a bug: it should set Current X and Current Y to 1 when setting "xy": [0,0].

Why 1?

I meant the CurrentX and CurrentY uint16 value of 1, not the JSON state.xy[0] or state.xy[1] value of 1.0. This 1 is the closest to 0 and should translate to 1/65535 or 0.000015.

According to the ZLL spec, for CurrentX and CurrentY:

A value of zero indicates invalid and the CurrentHue and CurrentSaturation attributes are used instead to determine the color.

The IKEA colour bulb doesn't support CurrentHue and CurrentSaturation and it acts weird when setting CurrentY to 0.

And there's still the issue that deCONZ maximises the values to 65279. Also it uses this number to convert the ZigBee uint16 value to a JSON float, instead of 65535, see #30, point 6.

There was a reason for that number, albeit I'd forgot which one, need to look in the spec and properly some ZLL firmware.

The ZCL spec specifies the range of CurrentX and CurrentY values to 0x0000 - 0xfeff or 65279. I can't find a reference to the maximum value in the ZLL spec. The IKEA bulb acts funny when setting CurrentY to 65536, but is happy with 65535.
So maximising the value to 0xfeff is probably correct, but the Hue bridge divides by 65535 (I think) to derive the JSON values. The maximum value would then translate to 0.996367, which is outside the CIE 1931 space anyways.
I'll do some testing and sniffing on the Hue bridge to confirm whether it actually uses 65535 or 65536, but it only shows four decimals, so I doubt it would make a difference.

ebaauw commented 7 years ago

The Hue bridge issues a Move to Color command when setting xy. It uses the xy values specified, multiplied by 65536, and adds the default transitiontime of 4. So 0 results in 0x0000, 1 results in 0xffff, 0.5 results in 0x8000. The Hue lights report the actual colour, so (0x2AC0, 0x0A3D) after moving to (0x0000, 0x0000) for a gamut B light (see https://developers.meethue.com/documentation/supported-lights). The Hue API translates this to [0.1670, 0.0400].

The Trådfri bulb echoes the colour moved to. Indeed, when setting {"xy":[0,0]} from the Hue bridge, the light goes into error as well. Same for setting {"xy":[1,0]}. Setting {"xy":[0,0.0001]}works (blue), as does setting {"xy":[1,0.0001]} (red).

The IKEA bulb acts funny when setting CurrentY to 65536, but is happy with 65535.

I stand corrected. Setting {"xy":[0,1]} works as well (green) - the light seems happy with a CurrentY value of 0xffff.

I remember getting a division by zero error in homebridge-hue when y = 0 while converting xy back to rgb. With Hue lights, this would never happen as I would map the values to the range supported by the light (as per Philips' guidelines), but it would happen with other lights where I don't know the gamut, and with groups, where I don't apply the mapping.

I suspect something similar is happening in the Trådfri firmware. IKEA probably didn't test this, since they only use a fixed set of colours from the remote.

manup commented 7 years ago

Very interesting results, maybe we should create/extend a Wiki page to document such issues for vendor, model id, firmware version. This can be forwarded to IKEA so they can fix it and provide OTA files.

The plugin could jump in between and check for bounds for certain devices via a static table with x-min, x-max, y-min and y-max, but this might be error prone and lead to unexpected results for group commands in mixed vendor setups.

The Hue bridge issues a Move to Color command when setting xy. It uses the xy values specified, multiplied by 65536.

Should be safe to the calculation in the same way.

ebaauw commented 7 years ago

The plugin could jump in between and check for bounds for certain devices via a static table with x-min, x-max, y-min and y-max,

Philips leaves that to the application, but I see value in providing this functionality in deCONZ. If you want to go there, you should actually check that the colour is within the triangle formed by the points for red (closest to (1, 0)), green (closest to (0, 1)), and blue (closest to (0, 0)), and, if not, adjust the colour to the nearest point on/in the triangle. See https://developers.meethue.com/documentation/color-conversions-rgb-xy or https://github.com/ebaauw/homebridge-hue/blob/196d0b71ceb3ea1a2e58fdb98e7777d81a1db09a/lib/HueLight.js#L28

To take this a step further, we could actually do the HSV to RGB to CIE 1931 xy conversion in deCONZ. I never understood the use of state.hue and state.sat as exposed by the Hue bridge. Alternatively, we could provide a state.rgb accepting the six digit hex string used for web colours.

but this might be error prone and lead to unexpected results for group commands in mixed vendor setups.

Yeah. I don't do any checking for group commands in homebridge-hue and rely on deCONZ (or Hue bridge) polling (or light attribute reporting) to report the actual values. Also, there seem to be more lights, like the IKEA, that don't report the actual values (see https://github.com/ebaauw/homebridge-hue/issues/171). The innr colour bulb does report the actual values, see https://github.com/ebaauw/homebridge-hue/issues/152. I don't know about the OSRAM bulb or gardenspot or the dresden elektronik ballasts (sorry). I still haven't included ph_light_values in homebridge-hue.

donnib commented 7 years ago

So here is an update on the problem that i opened here.

I am running 2.04.80 and the problems that i reported are not changed and may be even worse now. In this short sniff log i tried twice to turn on/off a IKEA light with mac address : "uniqueid": "00:0b:57:ff:fe:3a:36:3a-01" thru REST API and didn't work. The light didn't react but the REST API did. After upgrade i did sudo GCFFlasher_internal -r then i started deConz.

log_new.dcf.zip

I can see that there are few lines in deConz so my quess is that the ques filled up.

UPDATE: another issue i see is that the detection of nodes doesn't work as did before probably because of no polling. deConz finds much fewer nodes that it did in 2.04.79.

ebaauw commented 7 years ago

So here is an update on the problem that i opened here.

Sorry, @donnib, I got a bit off track.

Yes, I also see some meshing issues with v2.04.80 where switches don't seem to be reachable by the RaspBee, see #216. Also, I just experienced a hiccup, where it took like 5 to 10 seconds for the lights to react to the motion sensors (multiple rooms).

donnib commented 7 years ago

@ebaauw no problem at all, i think it's relevant. I am also interested in the investigations you are doing regarding the IKEA color bulb and will probably get some when/if we nail these problems.

manup commented 7 years ago

@donnib the log shows only 1 minute, is that correct? It looks like deCONZ does nothing at all, not a single request is in there?

Can you please provide an output with $ deCONZ --dbg-info=1 --dbg-zdp=1, if queues fill up it will be visible there.

donnib commented 7 years ago

@manup that is correct, i just started a sniffer log when i see problems and at the moment it's all the time. I power cycled the rPi this morning and when i came home everything came to a halt so i restarted deConz and here is the log, nothing reacts.

log.txt

manup commented 7 years ago

Thanks for the logs, they show that the queues are filled up.

device discover rotate, too busy (10 requests in que, 1)

Sadly I can't see what is in there, discovery and polling already take care to not fill up queues so it must be older code which is not guarded against this yet. I'll put up more debug output in the next version which shows in more detail whats going on and where the related tasks are added.

The queue problems can be debugged fixed but I'm currently not sure if their are IKEA specific problems as well, or something we didn't think of yet: WiFis on the same channel band, which can be checked with specific apps like WiFi analyzer — we have seen the most curious things in the past which may have a dramatic impact on RF.

So when the queues become stable where might be more to check.

donnib commented 7 years ago

@manup My WiFi is running 11+1 and 40+1 channels so there should not be interference, i live in a house so i can't imagine much interference from the neighbors.

manup commented 7 years ago

WiFi for 2.4 GHz has channels 1–14, the tricky part is that one WiFi channel does overlap 3 ZigBee channels as shown in the following image. Also take a look at the numbers while WiFi channel 11 is at the end ZigBee channel 11 is at the beginning.

So in your case of WiFi on channel 11 it's best to use ZigBee at channel 11, 15, 20 or 25 but not channels 21–24 (I figure that's already the case). This only takes care of WiFi, Bluetooth for instance does channel hopping and might cause problems too, albeit I have not seen this yet. RF can be tricky.

image

i live in a house so i can't imagine much interference from the neighbors.

Depends totally on what devices your neighbors are using.

We had this one case were a neighbor had house and garden full of Sonos WiFi streaming speakers with dynamic channels assignment basically jamming the whole spectrum over time :) Until this was solved the lower gain RF had near to no chance to come through.

WiFi/Bluetooth interference might not be an issue here, but it's always best to exclude it as reason for sure.

donnib commented 7 years ago

Yes so that's why i am running channel 25 and have been since i got the the RaspBee

donnib commented 7 years ago

@manup i upgraded to 2.04.81, i'll report status later today or tomorrow. After an upgrade before i start deConz i usually execute sudo GCFFlasher_internal -r and i just want to check if the output is correct ?

rmmod: ERROR: Module ftdi_sio is not currently loaded
rmmod: ERROR: Module usbserial is not currently loaded
GCFFlasher V2_10 (c) dresden elektronik ingenieurtechnik gmbh 2016/09/26
no device specified use RaspBee (/dev/ttyS0)
reset target SUCCESS
manup commented 7 years ago

@manup i upgraded to 2.04.81, i'll report status later today or tomorrow. After an upgrade before i start deConz i usually execute sudo GCFFlasher_internal -r and i just want to check if the output is correct ?

Works but it needs longer then to reestablish the routing tables, these aren't stored on the RaspBee yet. I normally to a -r only if RaspBee is completely stuck, which in my networks happens almost never.

If possible I would also recommend compile latest GIT version, there are two more commits after 2.04.81 which might be relevant in your network. Or wait till 2.04.82 comes out tomorrow.

For logs it would be helpful to expand output:

$ deCONZ --dbg-printf=1 --dbg-aps=1 --dbg-zdp=1

This is very noisy but should give more insight.

donnib commented 7 years ago

@manup i usually pull all changes from master so i think i am ok, i always get latest changes on master after i install new version of deConz. I am on 33b54cb and i did a build of it after installing deConz then launched deConz. I'll launch deConz with above command, can i just pipe output of deConz to a file like $ deCONZ --dbg-printf=1 --dbg-aps=1 --dbg-zdp=1 > log.txt ?

No problem i'll just upgrade again tomorrow, since nothing works for me atm i don't have many other options ;) so i am ok with that.

manup commented 7 years ago

Hehe ok, and yes pipe should work that way, haven't tried it though.

Meanwhile I have one IKEA bulb passed out too in my kitchen, it can be turned on off via power but isn't visible nor comes back to ZigBee network. It did run well over a month before that. Will do some sniffing before reset, hopefully I see something helpful.

donnib commented 7 years ago

@manup great or it's good if you can reproduce problems on your end so that you can troubleshoot it better. I'll get back with status when i know more.

manup commented 7 years ago

Meanwhile I have one IKEA bulb passed out too in my kitchen, it can be turned on off via power but isn't visible nor comes back to ZigBee network. It did run well over a month before that. Will do some sniffing before reset, hopefully I see something helpful.

Update With the sniffer I found the lost light. It's on the same channel but different ZigBee network, not a random one — the network is from my office and belongs to another gateway to which the light was paired to, two months ago.

Somehow the light turned back to the old configuration, so it seems that the lights reset to factory new function doesn't work cleanly.

donnib commented 7 years ago

@manup ok so that must be another issue which is not related to this right ?

So with 2.04.81 running since last night i can see this morning one of the IKEA lights went red. I tried 0 on the keyboard with the node selected but still red. I had the sniffer running but for some reason it seems stuck, with a messagebox empty and spinning in the corner and nothing comes in. Have you seen this ?

image 1

I see a bunch of this in the terminal from the BitCatcher :

Convertation error null on 647775 0x61 0x88 0x9A 0x5C 0x2B 0xF5 0x0F 0x38 0x4E 0x48 0x02 0xF5 0x0F 0x00 0x00 0x09 0x05 0x28 0x48 0xC8 0x1E 0x00 0x76 0x29 0x2A 0xFE 0xFF 0x57 0x0B 0x00 0x00 0x7E 0x2E 0x01 0xB7 0xE5 0xCF 0xC7 0xF2 0x84 0xDA 0x34 0x61 
07:13:46,949 [ERROR] - Convertation error null on 647775 0x61 0x88 0x9A 0x5C 0x2B 0xF5 0x0F 0x38 0x4E 0x48 0x02 0xF5 0x0F 0x00 0x00 0x09 0x05 0x28 0x48 0xC8 0x1E 0x00 0x76 0x29 0x2A 0xFE 0xFF 0x57 0x0B 0x00 0x00 0x7E 0x2E 0x01 0xB7 0xE5 0xCF 0xC7 0xF2 0x84 0xDA 0x34 0x61 
java.lang.NullPointerException
Convertation error null on 647777 0x61 0x88 0x55 0x5C 0x2B 0x00 0x00 0x30 0xAE 0x48 0x02 0x00 0x00 0x25 0x25 0x1D 0xEE 0x28 0x9D 0x18 0x09 0x00 0x5A 0x6B 0x44 0xFE 0xFF 0x57 0x0B 0x00 0x00 0x33 0xE5 0x91 0x0D 0x0A 0x80 0x82 0x8B 0xCB 0x97 0xE6 0xEC 0xFE 0xB2 0x3A 0x4B 0x93 0x12 0xC7 0xF9 0x54 0xFF 0xD3 0x70 
07:13:46,976 [ERROR] - Convertation error null on 647777 0x61 0x88 0x55 0x5C 0x2B 0x00 0x00 0x30 0xAE 0x48 0x02 0x00 0x00 0x25 0x25 0x1D 0xEE 0x28 0x9D 0x18 0x09 0x00 0x5A 0x6B 0x44 0xFE 0xFF 0x57 0x0B 0x00 0x00 0x33 0xE5 0x91 0x0D 0x0A 0x80 0x82 0x8B 0xCB 0x97 0xE6 0xEC 0xFE 0xB2 0x3A 0x4B 0x93 0x12 0xC7 0xF9 0x54 0xFF 0xD3 0x70 
java.lang.NullPointerException
Convertation error null on 647842 0x61 0x88 0x68 0x5C 0x2B 0x00 0x00 0x9A 0xEF 0x48 0x02 0x00 0x00 0x76 0xEC 0x1D 0xB2 0x28 0xF4 0x46 0x1F 0x00 0xC8 0xDB 0x37 0xFE 0xFF 0x2E 0x21 0x00 0x00 0x2A 0x9D 0x6C 0xCD 0x80 0x5F 0x17 0x97 0x8F 0x19 0x34 0xB3 0x51 0x54 0xFF 0x00 0x5C 0x35 0x83 
07:13:57,055 [ERROR] - Convertation error null on 647842 0x61 0x88 0x68 0x5C 0x2B 0x00 0x00 0x9A 0xEF 0x48 0x02 0x00 0x00 0x76 0xEC 0x1D 0xB2 0x28 0xF4 0x46 0x1F 0x00 0xC8 0xDB 0x37 0xFE 0xFF 0x2E 0x21 0x00 0x00 0x2A 0x9D 0x6C 0xCD 0x80 0x5F 0x17 0x97 0x8F 0x19 0x34 0xB3 0x51 0x54 0xFF 0x00 0x5C 0x35 0x83 
java.lang.NullPointerException
Convertation error null on 647846 0x61 0x00 0x48 0xF4 0x7B 0x5C 0xB0 0x00 0x00 0x48 0x02 0x76 0xEC 0x00 0x00 0x0A 0x16 0x28 0xBB 0xD4 0x90 0x00 0x84 0x02 0x01 0xFF 0xFF 0x2E 0x21 0x00 0x00 0x4D 0x35 0xA9 0x53 0x7A 0xEE 0x86 0xD2 0x2F 0x80 0x9D 0xCE 
07:13:57,131 [ERROR] - Convertation error null on 647846 0x61 0x00 0x48 0xF4 0x7B 0x5C 0xB0 0x00 0x00 0x48 0x02 0x76 0xEC 0x00 0x00 0x0A 0x16 0x28 0xBB 0xD4 0x90 0x00 0x84 0x02 0x01 0xFF 0xFF 0x2E 0x21 0x00 0x00 0x4D 0x35 0xA9 0x53 0x7A 0xEE 0x86 0xD2 0x2F 0x80 0x9D 0xCE 
java.lang.NullPointerException
Convertation error null on 647855 0x61 0x88 0x67 0x5C 0x2B 0xDF 0xF0 0xF0 0x60 0x48 0x02 0xDF 0xF0 0x00 0x00 0x09 0x17 0x28 0x94 0x46 0x17 0xFF 0x2E 0x21 0x00 0x8E 0xC2 0x2F 0xFE 0xFF 0x57 0x0B 0x00 0x28 0x93 0x46 0x17 0x00 0x32 0x3C 0x31 0xFE 0xFF 
07:13:57,238 [ERROR] - Convertation error null on 647855 0x61 0x88 0x67 0x5C 0x2B 0xDF 0xF0 0xF0 0x60 0x48 0x02 0xDF 0xF0 0x00 0x00 0x09 0x17 0x28 0x94 0x46 0x17 0xFF 0x2E 0x21 0x00 0x8E 0xC2 0x2F 0xFE 0xFF 0x57 0x0B 0x00 0x28 0x93 0x46 0x17 0x00 0x32 0x3C 0x31 0xFE 0xFF 
java.lang.NullPointerException
Convertation error null on 647859 0x02 0x00 0x4A 
07:13:57,267 [ERROR] - Convertation error null on 647859 0x02 0x00 0x4A 
java.lang.NullPointerException

I am not sure the log from BitCatcher is any use since i don't know exactly when it died but here it is anyway maybe we are lucky it caught the problem :

log_red_node.dcf.zip

Here is my network where you can see the red node. In total there are 57 nodes where one is red this morning.

image 2

Here is the node zoomed in so you can see the mac address :

image 3

And since i was running deConz with debug here is the log file from it :

deconz_log.tar.gz

But above red node is something that i have not seen some time maybe because the network has not been running. I am not sure this is related therefore here it is.

donnib commented 7 years ago

So i think after quite some testing i have some conclusions. version 2.04.82 works much better in my setup, i still see these occasionally (every day at some time) :

Also i think that the que can fill up if the user turns power off the light then if i send requests to the light while deConz still thinks the light is reachable.

donnib commented 7 years ago

@manup i am not sure you have seen my above post but here is another update :

The ques fill up very often when light do power cycle and the system doesn't ever come up again in a well running state, this happens to me very often since my wife uses the power switches during the day. Is there anything we can do to that ?

mariusmotea commented 7 years ago

I start to have more often issues with IKEA devices. Yesterday night i power off the raspbee module because both green and red leds remain stuck on and reboot of the raspberry did not solve the issue. In this morning i have big delays again. In my case for sure wifi interferences are not a problem because mine run on channel 3, my neighbors use ch 1, 4 and 6 and deconz is setup on channel 25. When i press switches buttons i have no events on websocket, when i turn on the lights via rest api sometimes is working with delay or don't turn the light on/off at all. Where i have multiple lights in one room in some cases one bulb don't change its state, white other do. Currently i decommission some classic wall switches and my wife start to be very upset because she cannot control the lights. After fresh start of deconz (sometimes raspbee require power off cycle) it run ok for some time so i don't believe the issue is with my ikea devices but more likely with deconz or raspbee firmware. I run the build .82 Marius.

donnib commented 7 years ago

@mariusmotea Yeah exact same problems i have. It's just hard to put my finger on when things start to derail but i have a feeling that power cycle does all sort of weird things which are not expected also the whole mesh network get's into problems when somebody turns off 10 nodes at the same time.

manup commented 7 years ago

I run the build .82

Please try with more recent version (currently 2.04.85) there were some improvements regard these issues.

There is also a new firmware in development which will supervise itself so it won't get stuck and reboot itself in worst case scenarios.

It's just hard to put my finger on when things start to derail but i have a feeling that power cycle does all sort of weird things which are not expected also the whole mesh network get's into problems when somebody turns off 10 nodes at the same time.

This can indeed be tricky but is expected behavior, in ZigBee it takes a while to heal the mesh: powering off many devices at one will cause many route errors which need to be resolved; new routes need to be established and many commands will run into a timeout ... all of that takes time especially when many nodes are powered off at once. Group commands should be fine though, since they don't require routing.

For this kind of setup, there many lights will be turned on/off, it's best to use smart switches, so the mesh stays intact all the time.

ebaauw commented 7 years ago

Group commands should be fine though, since they don't require routing.

On .84 I still have the occasional (once every two days or so) lockup, where the gateway doesn't seem to be able to send any ZigBee message. I don't power off any light. I almost exclusively use group commands (recalling scenes) from my gateway rules. These fail as well. The Hue dimmer switches continue to work interacting directly with groups. Even my Hue motion sensors still work - the CLIPGenericFlag on the gateways is updated and my music turns on (through HomeKit automation), but the lights don't. The Motion sensor blinks red, though, I assume because is doesn't receive an ack for the report attributes command. It would seem that the gateway is still able to receive ZigBee messages, but doesn't sent any. The only remedy is closing deCONZ, resetting the RaspBee (through sudo GCFFlasher_internal -r - @mariusmotea: no need to power off the Raspberry Pi), starting deCONZ.

manup commented 7 years ago

I have a presumption what is causing the blocked tx queue, which is currently under investigation.

The next version 2.04.86 will do a leave/rejoin automatically when the queue is locked for some time (~ 40 requests didn't get through) when --dbg-aps=1 is enabled the debug logs should show many APS-DATA.request id: X, status: BUSY (counter: Y) messages indicating a locked tx queue.

The leave/join should reset all device network queues and "workaround" the problem, you can try this manually when it happens the next time instead of sudo GCFFlasher_internal -r ... just a assumption couldn't test it myself yet.

However I guess the main reason for the locked queue is that it is full with pending requests to sleeping end-devices, these requests can live for hours before a timeout will be reported. I need to add more debugging options to verify the actual device queue state and destination addresses. If confirmed the situation can be improved a lot by just not keeping the request for that long or separating them into a extra larger queue for sleeping devices which won't lockup requests to mains powered devices.

mariusmotea commented 7 years ago

I upgrade to .84 but deconz don't ask to upgrade raspbee firmware. I will wait to see how it behave. My setup is made with 6 always on bulbs, 2 that can still be powered off from classic switch, 2 motion sensors, 9 remotes and one dimmer, all IKEA.

manup commented 7 years ago

Should be fine latest RaspBee firmware is 0x26190500 its was included a few releases ago.

My setup is made with 6 always on bulbs, 2 that can still be powered off from classic switch, 2 motion sensors, 9 remotes and one dimmer, all IKEA.

Perfect candidate for the above symptoms :) hope to get it resolved soon

donnib commented 7 years ago

The leave/join should reset all device network queues and "workaround" the problem, you can try this manually when it happens the next time instead of sudo GCFFlasher_internal -r ... just a assumption couldn't test it myself yet.

@manup i tried twice when i could not control the lights to Leave then Join and that did the trick. I am not sure this helps you in any way to remedy the real problem.

ebaauw commented 7 years ago

The leave/join should reset all device network queues and "workaround" the problem, you can try this manually when it happens the next time instead of sudo GCFFlasher_internal -r ... just a assumption couldn't test it myself yet.

Indeed, leave/join seems to do the trick. Now on .85, but still the occasional lockup.

The next version 2.04.86 will do a leave/rejoin automatically when the queue is locked for some time (~ 40 requests didn't get through) when --dbg-aps=1 is enabled the debug logs should show many APS-DATA.request id: X, status: BUSY (counter: Y) messages indicating a locked tx queue.

Looking forward to that version. In the meantime, is there a way to issue leave/join through the REST API or otherwise script it? I'll enable --dbg-aps=1 and see if that provides some additional insight.

manup commented 7 years ago

Looking forward to that version. In the meantime, is there a way to issue leave/join through the REST API or otherwise script it?

Albeit quite hacky, doing a touchlink scan via REST API will also leave/join the network, so it should cause the same effect. .86 will be available later today too.

ebaauw commented 7 years ago

Albeit quite hacky

;-)

doing a touchlink scan via REST API will also leave/join the network, so it should cause the same effect

Indeed, doing a ph_post /touchlink/scan does the trick.

The request returns an empty response body where normally a POST would return a success body (with the ID of the newly created resource). This caused my ph.sh script to report an error, but I changed that to check the HTTP/1.1 status in the response header instead of depending on a non-empty response body.

ebaauw commented 6 years ago

the debug logs should show many APS-DATA.request id: X, status: BUSY (counter: Y) messages indicating a locked tx queue

Indeed:

12:37:06:994 Poll APS request 225 to 0x00178801103C9D0A cluster: 0x0006
12:37:07:013 APS-DATA.request id: 225, addrmode: 0x03, addr: 0x00178801103c9d0a, profile: 0x0104, cluster: 0x0006, ep: 0x0B queue: 0 len: 5
12:37:07:049 APS-DATA.request id: 225, status: BUSY (counter: 2)
12:37:07:049 APS-DATA.request id: 225, set state: 0x05
12:37:07:049 emit artificial APSDE-DATA.confirm id: 225
12:37:07:049 Poll APS confirm 225 status: 0xAE
12:37:07:049 error APSDE-DATA.confirm: 0xAE on task
12:37:07:049 APS-DATA.request id: 225, failed-status: APP_BUSY (0x02) 
ebaauw commented 6 years ago

v2.04.86 seems to do the trick. Haven't seen any lockup yet. Log confirms it leaves/joins the network when stuck:

08:07:34:494 read attributes of 0x84182600000C7516 cluster: 0x0006: [ 08:07:34:494 0x0000 08:07:34:494 ]
08:07:34:494 add task 1308808 type 19 to 0x84182600000C7516 cluster 0x0006 req.id 196
08:07:34:494 Poll APS request 196 to 0x84182600000C7516 cluster: 0x0006
08:07:34:578 APS-DATA.request id: 196, addrmode: 0x03, addr: 0x84182600000c7516, profile: 0x0104, cluster: 0x0006, ep: 0x03 queue: 6 len: 5
08:07:34:613 APS-DATA.request id: 196, status: BUSY (counter: 51)
08:07:34:614 APS TX not working, force leave/join network to reset stack
manup commented 6 years ago

v2.04.86 seems to do the trick. Haven't seen any lockup yet. Log confirms it leaves/joins the network when stuck:

Glad to hear that, the cause of the problem should be solved in the next firmware versions too.

mariusmotea commented 6 years ago

Yesterday i update to .86 from .84. In this morning i left the house and come just 30 minutes ago. When i entered the motion sensor did not trigger the entrance light. I tried to turn on the lights in living room, from 2 bulbs only one was turned on with a 2 seconds delay. On bathroom none of the lights where turned on when i press the ON button on Tradfri remote. I check my HueEmulator logs and there where no websocket events from motion sensors or from bathroom remote. Turning the lights off and on partially fixed the issue, but still some lights did not turn on, or respond with big delay. For me .84 was a good improvement, i don't remember to have any issues, but with .86 i don't know if i was unlucky or is not that stable.

Update: not sure if this helps, with i need to pair my tradfri wireless dimmer again because it did not work after deconz restart.

manup commented 6 years ago

Update: not sure if this helps, with i need to pair my tradfri wireless dimmer again because it did not work after deconz restart.

Often it also helps to pull out the battery for a few seconds and put it in again. The sensor will then broadcast a announcement message after power-up.

ebaauw commented 6 years ago

v2.04.86 seems to do the trick. Haven't seen any lockup yet. Log confirms it leaves/joins the network when stuck:

Glad to hear that, the cause of the problem should be solved in the next firmware versions too.

Haven't seen any lockups since, now on v2.04.94, but still the occasional leave/join. After that, the name of some nodes in the GUI is lost: the RaspBee shows the NWK Address instead, and the ubisys dimmer shows XBee. Also, the first report from my Xioami door sensors is typically missed after the reconnect.