esphome / issues

Issue Tracker for ESPHome
https://esphome.io/
288 stars 34 forks source link

Ikea Vindriktning pm1006 sensor reboot issues #2715

Closed polleke69 closed 2 years ago

polleke69 commented 2 years ago

The problem

Hello, I am using 2 ikea vindriktning pm1006 sensors with esphome. The problem is that both units keep rebooting on a regular basis. In a period of about 8 days both units rebooted 20 (!) times.

Both are running the exact same esphome config. I tried disabling: logger, webserver, mDNS, AP, captive_portal but the issue remains.

Already hooked op 1 of the units on a serial line, but i didnt find any strange things in the logs...

Tried various different power adapters, no change.

At the moment 1 unit is running with the pm1006 sensor enabled, the other one without pm1006 sensor enabled. So far only the unit with the sensor enabled has rebooted 5 times...

To me this looks like a "bug" in the pm1006 sensor code...

Which version of ESPHome has the issue?

2021.x

What type of installation are you using?

pip

Which version of Home Assistant has the issue?

2021.x

What platform are you using?

ESP8266

Board

d1_mini

Component causing the issue

pm1006

Example YAML snippet

# Ikea Vindriktning PM2.5 sensor
uart:
  rx_pin: D2
  baud_rate: 9600

- platform: pm1006
    pm_2_5:
      name: "${devicename} PM25"
      accuracy_decimals: 2
      filters:
        - sliding_window_moving_average:
            window_size: 10
            send_every: 10
            send_first_at: 10

Anything in the logs that might be useful for us?

No response

Additional information

No response

probot-esphome[bot] commented 2 years ago

Hey there @habbie, mind taking a look at this issue as it has been labeled with an integration (pm1006) you are listed as a code owner for? Thanks! (message by CodeOwnersMention)

polleke69 commented 2 years ago

2021-11-18 10_39_14

mmakaay commented 2 years ago

I modded my Vindriktning, and I will monitor if I see the same behavior.

Habbie commented 2 years ago

Which version of ESPHome has the issue?

2021.x

Can you be more specific than this?

polleke69 commented 2 years ago

At the moment i am running 2021.10.3, but all previous versions had the same issue (i bought the vindriktnings in september...) If i remember correctly the 2020 esphome versions had the pm1006 integrated. So lets go from there...

mmakaay commented 2 years ago

If I'm correct, your devices reboot about every 10 hours. My device is running with an uptime sensor now, so I'll check tomorrow to see if there have been any reboots. In case there are, the most common issue would be a memory leak in the firmware.

polleke69 commented 2 years ago

Well, it seems to be more eratic:

2021-11-18 18_48_27-Window

2021-11-18 18_50_10-Window

mmakaay commented 2 years ago

I agree, that doesn't look regular enough for a memory leak issue.

polleke69 commented 2 years ago

If it helps, some logging from serial:

[01:24:58] [01:24:58] ets Jan 8 2013,rst cause:4, boot mode:(3,7) [01:24:58] [01:24:58]wdt reset [01:24:58]load 0x4010f000, len 3584, room 16 [01:24:58]tail 0 [01:24:58]chksum 0xb0 [01:24:58]csum 0xb0 [01:24:58]v2843a5ac [01:24:58]~ld [01:24:58][I][logger:174]: Log initialized

Habbie commented 2 years ago

This is a bit of a random guess, but all non-crashing configs I saw before did not have the filter. Can you try running one of them with pm1006 but without the filter?

mmakaay commented 2 years ago

I've got only one running here, and I added the filter to be sure to have the same setup as @polleke69. So far I haven't seen any crashes. 11 hours and counting.

Habbie commented 2 years ago

The crashing one - does it report reliably while it is running? Looking at the existing code (I have a refactor sitting in a branch that I should really finish) I wonder how well it would deal with long strings of garbage. It should do fine, but it's not the greatest code.

mmakaay commented 2 years ago

Oh, I missed your serial output. In there, I see rst cause:4. That would mean that the WDT (watch dog timer) did reset the device AFAIK. And that could be caused by some code locking up the device. I don't see what part of the code that would be, so I'll keep running my setup.

To be sure: can you try another USB power supply, to rule out that the device isn't getting enough power right now? Underpowered devices can run into all kinds of issues.

Habbie commented 2 years ago

I did not look up cause 4 but I also spotted wdt, which is why I wonder about garbage strings :)

mmakaay commented 2 years ago

That would have to be really large garbage strings to make those lock up your loop implementation. Since the check for available() bytes on the UART would have to be true until the WDT is triggered. From what I see on my device using the UART debugger, it's small chunks once in a while though.

mmakaay commented 2 years ago

The reader could focus a bit more on finding 0x16, 0x11, 0x0B in the serial data to find the start of the message and work from there (that's what's in your new branch?) That would re-align the reader with the output when the reader gets out of sync with the data from the sensor. But even when that goes wrong somehow because of garbage data, worst that could happen (if I read your code correctly, but I must admit it's beyond bed time for me) is missing data and log output filled with CRC mismatches.

Habbie commented 2 years ago

Same time here I suspect! The loop code -should- exit because check_byte_ triggers on the max message length as well, and I'd expect the watchdog to be happy then.

MallocArray commented 2 years ago

Do you happen to be updating any code using OTA? My device hangs when using UART and attempting to OTA update and then it will go into a reboot loop for quite awhile after that. If I flash it with a USB cable, it seems stable.

I don't think I'm seeing the same number of reboots that you are, but if you are looking for things to try, you might try removing ota:

polleke69 commented 2 years ago

A lot of suggestions ;-)

Tonight/this weekend i will give it a go without the filter and separately without ota: and keep u posted.

Normally i use ota: without any issues, but i also tried full erasing and flashing via serial with no difference in the reboot behaviour.

I never saw any issues with reporting of the units when pm1006 is active.

I already exchanged about 5 (different) power adapters, so i would say that rules out any power issues.

mmakaay commented 2 years ago

Removing ota: shouldn't be needed. THere can be ota issues, when software UART is used. My guess is that the interrupt handling for incoming data is interfering with the network stack, causing an onging OTA update to fail Normally, a few retries fixes this for me. That's why I created another ticket in which I propose to make sure that hardware UART is used for this device.

I'm afraid my device is still running, so I haven't been able to reproduce this yet.

polleke69 commented 2 years ago

Enabled pm1006 without filter on vindriktning1:

2021-11-20 10_01_38-Overview - Home Assistant - Brave

And rebooting again...

mmakaay commented 2 years ago

And guess what? Another day with no reboots here.

Can you maybe try soldering the Rx D2 on D7 instead and setting (next to the D2 to D7 pin change):

logger:
  baud_rate: 0

That should give you hardware uart instead of software uart. That gives us one more component we can rule out (or in) as the culprit.

polleke69 commented 2 years ago

It seems the one running WITHOUT the filter is "a bit" more stable

2021-11-21 22_34_14-Window

mmakaay commented 2 years ago

So both are unstable ;-)

polleke69 commented 2 years ago

Right!

Next thing to try will be the hardware uart change... (somewhere this week)

polleke69 commented 2 years ago

Currently running 1 vindriktning on D2 and 1 running on D7 (hardware uart) both with filter.

Fingers crossed...

logger: baud_rate: 0

polleke69 commented 2 years ago

almost 2 days and counting...

2021-11-24 08_04_33

mmakaay commented 2 years ago

And that is the D7 one I presume?

polleke69 commented 2 years ago

vindriktning1 = D7 vindriktning2 = D2

polleke69 commented 2 years ago

Now the question remaining is: why does the software uart causes these reboots?

mmakaay commented 2 years ago

Yes, especially because I have software UART myself and my two devices are running without hickups :-/ Could always be a hardware issue of course, so it would be useful to know if others are experiencing this behavior too.

polleke69 commented 2 years ago

what brand of esp's are u using?

mmakaay commented 2 years ago

Some D1 Mini clone from China, not branded.

polleke69 commented 2 years ago

Hmm, same here...

I think i'm just gonna change the other one as well and leave it at that...

2021-11-25 08_46_46

mmakaay commented 2 years ago

Cool that it is stable this way. Uncool that we don't know why you get reboot. The balance tips to "cool" 😎👍

polleke69 commented 2 years ago

Yeah, but for the uncool part it would be nice to know why indeed...

mmakaay commented 2 years ago

What would be helpful in this case, is to get some insight in what the device logs during the reboots. If you would like to check this out, you could do the following:

polleke69 commented 2 years ago

If u scroll back a couple of posts, i posted some logging there. (if u need more, lemme know)

mmakaay commented 2 years ago

Yeah, that showed the WDT reset, so some component locking up the loop. So that was a start. But when this happens, there should also be a stack trace right before that reboot message. When using the esphome command line tool for viewing the serial logging, it will decode the stack trace into code files + line numbers for you. When using an other tool, you'd get a list of hex addresses that would need manual decoding (for which you'd also nee the firmware.elf file that is created during the build). Optimal result would be to get a hint about where exactly in the code the issue occurs.

Another thing that might be interesting, is to enable the UART debugger on the software UART, so see if the data comes in as expected, or if there's garbage on the line there. You could extend the uart config for vindrinktning2 as follows for this:

# Ikea Vindriktning PM2.5 sensor
uart:
  rx_pin: D2
  baud_rate: 9600
  debug:
    direction: RX
    after:
      bytes: 60
    sequence:
      - lambda: UARTDebug::log_hex(direction, bytes, ':');
polleke69 commented 2 years ago

This was all...

[05:37:59][D][pm1006:091]: Got PM2.5 Concentration: 326 µg/m³ [05:37:59][D][uart_debug:109]: <<< 16:11:0B:00:00:01:46:00:00:07:3A:00:00:01:7B:02:00:00:44:84 [05:38:01][D][sensor:113]: 'vindriktning2 WiFi Signal': Sending state -65.00000 dBm with 0 decimals of accuracy [05:38:01][D][pm1006:091]: Got PM2.5 Concentration: 327 µg/m³ [05:38:01][D][uart_debug:109]: <<< 16:11:0B:00:00:01:47:00:00:07:3B:00:00:01:7C:02:00:00:44:81 [05:38:12] [05:38:12] ets Jan 8 2013,rst cause:4, boot mode:(3,6) [05:38:12] [05:38:12]wdt reset [05:38:12]load 0x4010f000, len 3584, room 16 [05:38:12]tail 0 [05:38:12]chksum 0xb0 [05:38:12]csum 0xb0 [05:38:12]v2843a5ac [05:38:12]~ld [05:38:12][I][logger:214]: Log initialized [05:38:12][C][ota:452]: There have been 0 suspected unsuccessful boot attempts. [05:38:12][I][app:029]: Running through setup()... [05:38:12][C][uart.arduino_esp8266:052]: Setting up UART bus... [05:38:12][D][text_sensor:067]: 'vindriktning2 ESPHome': Sending state '2021.11.1 Nov 25 2021, 17:15:43'

mmakaay commented 2 years ago

Wow, that is quite silent indeed. The UART data looks okay. These are logged after the accompanying "Got PM2.5..." messages, because the debugger only logs the data after seeing no new data come in for a little while.

The data come in about once per minute looking at that output, so the next dat would arrive around 05:39. Way before that, the device reboots. So that makes it even more confusing, in case it is related so software UART, since at the time of the reboot, there is most likely no serial data coming in.

You can change bytes: 60 to bytes: 1, which will cause the debugger to log every UART byte directly on a single log line. That way we would know for sure if there's really no data coming in, that is possibly tripping the device.

polleke69 commented 2 years ago

Not a lot of usefull info i guess...

[11:13:35][D][pm1006:091]: Got PM2.5 Concentration: 421 µg/m³ [11:13:37][D][uart_debug:109]: <<< 16 [11:13:37][D][uart_debug:109]: <<< 11 [11:13:37][D][uart_debug:109]: <<< 0B [11:13:37][D][uart_debug:109]: <<< 00 [11:13:37][D][uart_debug:109]: <<< 00 [11:13:37][D][uart_debug:109]: <<< 01 [11:13:37][D][uart_debug:109]: <<< A5 [11:13:37][D][uart_debug:109]: <<< 00 [11:13:37][D][uart_debug:109]: <<< 00 [11:13:37][D][uart_debug:109]: <<< 08 [11:13:37][D][uart_debug:109]: <<< 6B [11:13:37][D][uart_debug:109]: <<< 00 [11:13:37][D][uart_debug:109]: <<< 00 [11:13:37][D][uart_debug:109]: <<< 01 [11:13:37][D][uart_debug:109]: <<< D5 [11:13:37][D][uart_debug:109]: <<< 02 [11:13:37][D][uart_debug:109]: <<< 00 [11:13:37][D][uart_debug:109]: <<< 00 [11:13:37][D][uart_debug:109]: <<< 44 [11:13:37][D][uart_debug:109]: <<< 99 [11:13:37][D][pm1006:091]: Got PM2.5 Concentration: 421 µg/m³ [11:13:37][D][sensor:113]: 'vindriktning2 PM25': Sending state 418.89999 µg/m³ with 2 decimals of accuracy [11:13:58][D][uart_debug:109]: <<< 16 [11:13:58][D][uart_debug:109]: <<< 11 [11:13:58][D][uart_debug:109]: <<< 0B [11:13:58][D][uart_debug:109]: <<< 00 [11:13:58][D][uart_debug:109]: <<< 00 [11:13:58][D][uart_debug:109]: <<< 01 [11:13:58][D][uart_debug:109]: <<< AB [11:13:58][D][uart_debug:109]: <<< 00 [11:13:58][D][uart_debug:109]: <<< 00 [11:13:58][D][uart_debug:109]: <<< 08 [11:13:58][D][uart_debug:109]: <<< 7B [11:13:58][D][uart_debug:109]: <<< 00 [11:13:58][D][uart_debug:109]: <<< 00 [11:13:58][D][uart_debug:109]: <<< 01 [11:13:58][D][uart_debug:109]: <<< DB [11:13:58][D][uart_debug:109]: <<< 02 [11:13:58][D][uart_debug:109]: <<< 00 [11:13:58][D][uart_debug:109]: <<< 00 [11:13:58][D][uart_debug:109]: <<< 44 [11:13:58][D][uart_debug:109]: <<< 7D [11:13:58][D][pm1006:091]: Got PM2.5 Concentration: 427 µg/m³ [11:14:08] [11:14:08] ets Jan 8 2013,rst cause:4, boot mode:(3,6) [11:14:08] [11:14:08]wdt reset [11:14:08]load 0x4010f000, len 3584, room 16 [11:14:08]tail 0 [11:14:08]chksum 0xb0 [11:14:08]csum 0xb0 [11:14:08]v2843a5ac [11:14:08]~ld [11:14:08][I][logger:214]: Log initialized [11:14:08][C][ota:452]: There have been 0 suspected unsuccessful boot attempts. [11:14:08][I][app:029]: Running through setup()... [11:14:08][C][uart.arduino_esp8266:052]: Setting up UART bus... [11:14:08][D][text_sensor:067]: 'vindriktning2 ESPHome': Sending state '2021.11.1 Nov 27 2021, 09:10:48' [11:14:08][D][binary_sensor:034]: 'vindriktning2 status': Sending initial state OFF [11:14:08][C][wifi:037]: Setting up WiFi... [11:14:08][I][wifi:245]: WiFi Connecting to 'crap2'... [11:14:08][D][sensor:113]: 'vindriktning2 Uptime': Sending state 0.11800 s with 0 decimals of accuracy [11:14:08][D][uart_debug:109]: <<< 16 [11:14:08][D][uart_debug:109]: <<< 11 [11:14:08][D][uart_debug:109]: <<< 0B [11:14:08][D][uart_debug:109]: <<< 00 [11:14:08][D][uart_debug:109]: <<< 00 [11:14:08][D][uart_debug:109]: <<< 01 [11:14:08][D][uart_debug:109]: <<< AF [11:14:08][D][uart_debug:109]: <<< 00 [11:14:08][D][uart_debug:109]: <<< 00 [11:14:08][D][uart_debug:109]: <<< 08 [11:14:08][D][uart_debug:109]: <<< 8B [11:14:08][D][uart_debug:109]: <<< 00 [11:14:08][D][uart_debug:109]: <<< 00 [11:14:08][D][uart_debug:109]: <<< 01 [11:14:08][D][uart_debug:109]: <<< DF [11:14:08][D][uart_debug:109]: <<< 02 [11:14:08][D][uart_debug:109]: <<< 00 [11:14:08][D][uart_debug:109]: <<< 00 [11:14:08][D][uart_debug:109]: <<< 44 [11:14:08][D][uart_debug:109]: <<< 65 [11:14:08][D][pm1006:091]: Got PM2.5 Concentration: 431 µg/m³ [11:15:17][D][uart_debug:109]: <<< 00 [11:15:17][D][uart_debug:109]: <<< 00 [11:15:17][D][uart_debug:109]: <<< 44 [11:15:17][D][uart_debug:109]: <<< 57 [11:15:17][D][pm1006:091]: Got PM2.5 Concentration: 486 µg/m³

mmakaay commented 2 years ago

Well, useful in the sense that we now know that it's not incoming serial garbage data that is locking up the device. but you're right, this only thickens the mystery in a way ;)

webeling67 commented 2 years ago

Maybe it is not a software bug? The German magazine Make has connected a D1 mini with the IKEA board using two resistors (3k/1k) to take care of the voltage differences between the IKEA sensor voltage (4,5V) and the specified D1 Mini voltage of 3,3 V. The wiring in the HA community is directly connecting without a voltage adjustment. This would lead to higher than specified voltages at the serial input pin of the D1 and may cause the unstable UAT.

polleke69 commented 2 years ago

Got that, but how does that explain that connecting the sensor output to D7 (hardware uart) doesnt cause reboots?

As far as i know that would also operate at 3.3v then...

mmakaay commented 2 years ago

Also, the ESP8266 is supposedly 5V tolerant. Never found a 100% sure report on that, but it does seem true based on my experience. Also: my own Vindriktning is running on D2 with software UART, and that one doesn't show any issues at all. That doesn't support a general issue case either.

Conundrum I say.

MallocArray commented 2 years ago

Other than it rebooting sometimes, does it interrupt the reporting of data to HomeAssistant? I have the same configuration with a unbranded D1 mini with the same filter only I'm sending it every 7 on each parameter. I can't really say if it is rebooting as I haven't watched it that close, but I get consistent readings in HA so even if it was rebooting, it doesn't seem to be impacting my desired output.

polleke69 commented 2 years ago

I've got the filter at 30s so i don't miss a whole lot of data (probably nothing).

I'm just gonna connect to D7 and leave it at that...

mmakaay commented 2 years ago

Maybe that's the way to go. At least I am out of options to come up with tests to pinpoint the actual issue here. I'm willing to go with "something hardware yada yada", and be happy with the D7 solution for it. So feel free to close this issue when you reached the happy puppy stadium.