eelcohn / nRF905-API

API webinterface for the nRF905
MIT License
63 stars 20 forks source link

v1 crashes after a few hours #14

Open freakshock88 opened 3 years ago

freakshock88 commented 3 years ago

Since upgrading to v1, my NodeMCU ESP8266 crashes after running the API for a few hours. The blue LED also turns off.

I can remove power and restore it, then the NodeMCU boots and the API works again for a few hours.

eelcohn commented 3 years ago

Ik heb inmiddels zelf ook het probleem ervaren met mijn setup hier. Ik heb nog geen exact idee waardoor dit wordt veroorzaakt, maar een paar mogelijkheden zijn:

Wat kan helpen is om uit te vinden of er ook iets bijzonders te zien is op de seriële poort voordat de boel crasht. Ikzelf heb inmiddels de nRF905 via een USB kabel gekoppeld aan een losse raspberry pi die ik nog had liggen, waar ik nu een seriële monitor op heb draaien.

DevSecNinja commented 3 years ago

I have the same issue and I used a serial monitor with HA to get the output of it. Just before it crashed, it returned:

fan_main_unit_id=48

image

If you want to create the serial monitor sensor as well, attach the USB cable to the Home Assistant host and use the following config (make sure to double check the serial port):

- platform: serial
  serial_port: /dev/serial/by-id/usb-Silicon_Labs_CP2102_USB_to_UART_Bridge_Controller_0001-if00-port0
  name: "Ventilation-RC Serial"
  baudrate: 115200
didiermiller commented 3 years ago

I have the same issue with a Wemos D1 Mini (ESP8266), it disconnected after 19 hours. I cannot use the USB cable trick described above though.

freakshock88 commented 3 years ago

Were you able to make any progress with the logging @eelcohn ?

chrisjstevo commented 3 years ago

I have the same problem. Dies after 12-15 hrs of use. Turning off and on the power doesn't help. I have to remove the USB connector and re-insert.

proditaki commented 3 years ago

I have the same isue with my ESP32 Dev board. Around 12 hours indeed.

update: Now it doesn't connect to the fan anymore after the second time it crashes.

Seems like it lost the fan settings, had to reinitialise the fan connection.

chrisjstevo commented 3 years ago

I'm happy to have a go at adding some serial port logging to see if its something easy, the C++ is not very complicated. The issue that I have (as well as it bombing after 12 hours) is that chip I bought is a PTR8000 (https://www.amazon.nl/gp/product/B08CCP13HB) it only has a 2 meter range. Which would mean I'd have to put the computer in the loft.

Can anyone recommend a chip that works over 20 meters or so that's available in NL? If so I'll buy that and debug what's going on.

abmaonline commented 3 years ago

@chrisjstevo I know eelcohn has ordered a new model that in theory actually supports the 800MHz band: https://gathering.tweakers.net/forum/list_message/64826542#64826542, but haven't seen the results yet (most of the existing boards are build for 400MHz and 800MHz 'support' is accidental, hence the limited range).

My PTR8000 with Wemos D1 mini is running fine for more than 2 days now, but that is w/o any polling of state by Home Assistant. I only check the status manually with the api a few times a day. As soon as I turn on the REST sensor in Home Assistant (polling it every 30s I think), it stops responding after a few hours. Have to reconnect the usb connector to make it work again.

proditaki commented 3 years ago

I'm polling every minute in the script i made for domoticz. I'll try changing it to every 15 minutes and see what that does.

Almost sounds like some kind of memory leak or something.

chrisjstevo commented 3 years ago

@abmaonline I have a smart plug that runs on a timer, I will try setting this up to restart the device every few hours and see if it increases the time till failure. If so that would point to a mem issue I agree.

didiermiller commented 3 years ago

@proditaki I have a similar issue with Home Assistant. After 12 hours the Wemos D1 mini crashes, and only a hard reset will solve it. I have changed polling interval from 30 seconds (default) to 300 seconds and I will see if that solves anything. Would be nice to get some debug logging going but in my case that is not possible for me...

Update: it crashed after 29 hours.

proditaki commented 3 years ago

It's already running for 20 hours now when polling every 15 minutes, so it seems that there's a connection to the bug there. Update: 33 hours. Update 2: multiple days now so less polling equals more uptime.

chrisjstevo commented 3 years ago

I'd agree with @proditaki, I've had the chip running for 3.5 days now. I reboot it every twelve hours using a smart plug and have seen no issues.

If this were a normal cpp app running on an OS I would look to a memory leak or heap fragmentation. Because this is running on the chip and I assume doesn't really have an OS, I'm not sure what services are offered at all by the underlying runtime. I suspect not many. I'll try and see how one would debug a memory leak on such a hardware setup.

A quick and dirty solution might be to call the reset on the chip after every n calls. Effectively doing what my plug is doing but in code.

proditaki commented 3 years ago

I might reset it every night from my domoticz plugin for now as a failsafe. (I can share the plugin if anywone needs it for domoticz)

is this API call enough for a reset? http://192.168.x.y/api/v1/systemconfig.json?reset=true

update: since i run my domoticz on a pi i added this to the cron.daily. Let's see if this will help to keep it running wget --http-user=admin --http-password=xxxxx http://192.168.x.x/api/v1/systemconfig.json?reset=true &>> /dev/null

update 2021-02-04: this has been working fine for almost a week now, so acceptable as a workaround for now.

xirixiz commented 3 years ago

Yes, same issue indeed. I solved it via hass adding a reset rest command, script and some automation to reset every hour: https://github.com/xirixiz/my-hass-config/blob/master/packages/ventilation.yaml

Next week I`ll have some time to go through the api code. Not sure, but my guess this can be resolved and make this solution more robust.

proditaki commented 3 years ago

update: never mind, somehow the IP address changed, allthough i gave it a static lease.. weird

~It died this morning, although i reset it every night, so something is still not right.~ ~It's responding to ping, but that's about it. The web server is not responding at all.~

~I unplugged the power and replugged it still now response from the webserver, responds to ping fine.~

~when i netcat to the web port.. nothing~

pi@raspberrypi:~/domoticz/plugins/OTGW_Mqtt_Client $ nc 192.168.2.14 80
get
eelcohn commented 3 years ago

Thank you all for taking the effort to solve this problem 👍 Unfortunately I haven't found the time to work on a solution yet, but I agree with @proditaki that it's most likely a memory leak issue.

I'm wondering if there's a difference between systems with or without CO2-sensors. Since CO2-sensors send out data every minute or so, the nRF905API receives way more data on systems with CO2-sensors. If it's a buffer overlow issue related to received data, then theoretically the nRF905API should crash sooner on systems with CO2-sensors, and should crash later on systems with just the RF remote controls.

xirixiz commented 3 years ago

@eelcohn my setup is without CO2-sensors 👍.

proditaki commented 3 years ago

Mine is without CO2 sensors as well. The reboot every night workaround is working fine for now. Except from when the IP address changed once :p

didiermiller commented 3 years ago

No CO2 sensor here as well. Reboot is working fine, took it from xirixiz' HA config :) So a workaround for now...

nicandris commented 3 years ago

I used an esp32 nodemcu but i dont poll for status at all. i didnt had any freezes AT ALL for weeks, no restarts needed/done or anything else. I also dont have a co2 sensor

freakshock88 commented 3 years ago

My setup is without CO2 sensors as well, but I do poll the status every minute.

DevSecNinja commented 3 years ago

Mine is with CO2 sensor too. It seems to crash in a few days, but have to keep an eye on the exact timelines to be sure.

chrisjstevo commented 3 years ago

I'm not using Home Assistant in my setup. I have a python script which is listening on some humidity sensors and sending a request to the module when humidity gets above a certain %'s. Before I was polling the status from the chip each time the python script updated the fan state. I turned this off two weeks ago and have it just telling the chip the fan speed when its needed and never asking it what the speed is.

It hasn't crashed once since.

It looks very much like the issue is with querying the device.

xirixiz commented 3 years ago

Any update?

pvossel commented 3 years ago

Co2 sensor here aswell. Works for couple hours. After that it crashes. Reset command trick does not work. At the moment its not working anymore the serial logger shows me that my esp8266 is going back to:

"WiFi SoftAP. IP address: 192.168.4.1"

Altho no SSID shows up in my Wifi list.

My esp8266 is 3 meters away from my accesspoint. My other esp8266 (from other projects) dont experiance disconnects.

Hopefully there will be a fix soon.

DevSecNinja commented 3 years ago

I attached my device to my server and I'm running the following to get the serial logs:

sudo cat /dev/ttyUSB0 | gawk '{ print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }' | tee -a ttylog.txt

Try the following to get your TTY if it doesn't work:

dmesg | cat tty

It should show you something like this after a few seconds:

image

DevSecNinja commented 3 years ago

I catched an exception! @eelcohn I can DM you the full stack trace if you want. I'm using API V1 with the CO2 sensor.

[2021-06-13 20:50:59] Starting transmission 4 3 2 1 - done.
[2021-06-13 20:50:59] http: /api/v1/send.json
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59] Data received
[2021-06-13 20:50:59]
[2021-06-13 20:50:59] Exception (0):
[2021-06-13 20:50:59] epc1=0x40208e4c epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000000 depc=0x00000000
[2021-06-13 20:50:59]
[2021-06-13 20:50:59] >>>stack>>>
[2021-06-13 20:50:59]
[2021-06-13 20:50:59] ctx: sys
[2021-06-13 20:50:59] sp: 3fffeb90 end: 3fffffb0 offset: 01a0
eelcohn commented 3 years ago

Nice work @DevSecNinja , thanks! Can you send the full stack trace via a Tweakers DM? (DM's aren't available in GitHub AFAIK)

pvossel commented 3 years ago

Any news regarding the catched exception?

DevSecNinja commented 3 years ago

Any news regarding the catched exception?

I have sent the catched exception to @eelcohn. Let's hope the exception gives some good info. :)

MunkeyBalls commented 2 years ago

@eelcohn, any updates on a possible fix? Project is awesome btw, thanks for your work.

sebastius commented 2 years ago

Yeah its the receive-buffer:

Data received - not stored due to buffer overflow! Data received - not stored due to buffer overflow! Guru Meditation Error: Core 0 panic'ed (Interrupt wdt timeout on CPU0).

Core 0 register dump: PC : 0x4008f0c2 PS : 0x00060335 A0 : 0x8008ebdf A1 : 0x3ffbd010
A2 : 0x3ffbf1a8 A3 : 0x3ffbd02c A4 : 0x00060323 A5 : 0xb33fffff
A6 : 0x0000cdcd A7 : 0x0000abab A8 : 0x0000abab A9 : 0x0000abab
A10 : 0x00000003 A11 : 0x00060323 A12 : 0x00060320 A13 : 0x3ffcdcf0
A14 : 0x00011c5c A15 : 0x0000040b SAR : 0x0000001c EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x40084a45 LEND : 0x40084a4d LCOUNT : 0x00000027

Backtrace:0x4008f0bf:0x3ffbd0100x4008ebdc:0x3ffbd050 0x4008d2cb:0x3ffbd070 0x4008d37c:0x3ffbd0b0 0x400839c9:0x3ffbd0d0 0x40156d9f:0x3ffbd0f0 0x40134670:0x3ffbd110 0x40131709:0x3ffbd130 0x40156cbf:0x3ffbd150 0x40156cf9:0x3ffbd1a0 0x401588d5:0x3ffbd1c0 0x401557bc:0x3ffbd1e0

Core 1 register dump: PC : 0x4008ef38 PS : 0x00060735 A0 : 0x8008e15a A1 : 0x3ffbeff0
A2 : 0x3ffb8a00 A3 : 0x3ffb8890 A4 : 0x00000004 A5 : 0xb33fffff
A6 : 0x00000001 A7 : 0x00000001 A8 : 0x3ffb8890 A9 : 0x00000018
A10 : 0x3ffb8890 A11 : 0x00000018 A12 : 0x3ffc53ec A13 : 0xb33fffff
A14 : 0x00000001 A15 : 0x00000001 SAR : 0x00000011 EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x40089da9 LEND : 0x40089db9 LCOUNT : 0xffffffff

Backtrace:0x4008ef35:0x3ffbeff00x4008e157:0x3ffbf010 0x4008d297:0x3ffbf030 0x400e849f:0x3ffbf070 0x400e6cdd:0x3ffbf090 0x400e71d7:0x3ffbf0b0 0x40081421:0x3ffbf150 0x400e5a95:0x3ffbf170 0x40084f65:0x3ffbf190 0x400e832f:0x3ffb27b0 0x400e6c61:0x3ffb27e0 0x400e6cf9:0x3ffb2800 0x400e81e2:0x3ffb2820

DevSecNinja commented 2 years ago

Yeah its the receive-buffer:

Data received - not stored due to buffer overflow! Data received - not stored due to buffer overflow! Guru Meditation Error: Core 0 panic'ed (Interrupt wdt timeout on CPU0).

Core 0 register dump: PC : 0x4008f0c2 PS : 0x00060335 A0 : 0x8008ebdf A1 : 0x3ffbd010 A2 : 0x3ffbf1a8 A3 : 0x3ffbd02c A4 : 0x00060323 A5 : 0xb33fffff A6 : 0x0000cdcd A7 : 0x0000abab A8 : 0x0000abab A9 : 0x0000abab A10 : 0x00000003 A11 : 0x00060323 A12 : 0x00060320 A13 : 0x3ffcdcf0 A14 : 0x00011c5c A15 : 0x0000040b SAR : 0x0000001c EXCCAUSE: 0x00000005 EXCVADDR: 0x00000000 LBEG : 0x40084a45 LEND : 0x40084a4d LCOUNT : 0x00000027

Backtrace:0x4008f0bf:0x3ffbd0100x4008ebdc:0x3ffbd050 0x4008d2cb:0x3ffbd070 0x4008d37c:0x3ffbd0b0 0x400839c9:0x3ffbd0d0 0x40156d9f:0x3ffbd0f0 0x40134670:0x3ffbd110 0x40131709:0x3ffbd130 0x40156cbf:0x3ffbd150 0x40156cf9:0x3ffbd1a0 0x401588d5:0x3ffbd1c0 0x401557bc:0x3ffbd1e0

Core 1 register dump: PC : 0x4008ef38 PS : 0x00060735 A0 : 0x8008e15a A1 : 0x3ffbeff0 A2 : 0x3ffb8a00 A3 : 0x3ffb8890 A4 : 0x00000004 A5 : 0xb33fffff A6 : 0x00000001 A7 : 0x00000001 A8 : 0x3ffb8890 A9 : 0x00000018 A10 : 0x3ffb8890 A11 : 0x00000018 A12 : 0x3ffc53ec A13 : 0xb33fffff A14 : 0x00000001 A15 : 0x00000001 SAR : 0x00000011 EXCCAUSE: 0x00000005 EXCVADDR: 0x00000000 LBEG : 0x40089da9 LEND : 0x40089db9 LCOUNT : 0xffffffff

Backtrace:0x4008ef35:0x3ffbeff00x4008e157:0x3ffbf010 0x4008d297:0x3ffbf030 0x400e849f:0x3ffbf070 0x400e6cdd:0x3ffbf090 0x400e71d7:0x3ffbf0b0 0x40081421:0x3ffbf150 0x400e5a95:0x3ffbf170 0x40084f65:0x3ffbf190 0x400e832f:0x3ffb27b0 0x400e6c61:0x3ffb27e0 0x400e6cf9:0x3ffb2800 0x400e81e2:0x3ffb2820

Thanks! Based on what do you determine it's the receive-buffer? I'm not really familiar with C/C++. Thanks!

renini commented 2 years ago

Would calling api/v1/receive.json periodically work as a work around? To clear the buffer?

proditaki commented 1 year ago

Would calling api/v1/receive.json periodically work as a work around? To clear the buffer?

I don't know, it resets the index for the buffer if you look at the code. But the receive function is called all over the place so it should be reset quite frequently.

What i do is just reset the device every night with a cron job.

eelcohn commented 1 year ago

I think (hope) that I finally found why this problem is happening: the call to Serial.print() inside the interrupt routine. This triggers the watchdog, because calling Serial.print() just takes too long, and is not suitable for interrupts. I'm working on version 2.0.0 in which this issue should hopefully be fixed...

freakshock88 commented 1 year ago

Any update on this @eelcohn ? :)

xirixiz commented 1 year ago

Mss heb je hier meer aan @freakshock88:

https://gist.github.com/golles/ae32d9a7c14b63d9d68f6ff9a6fd4d6a

Nou ja, met name deze link: https://github.com/Sanderhuisman/ESPHome-Zehnder-RF

freakshock88 commented 1 year ago

Thanks @xirixiz , wist niet dat er al ESPHome functionaliteit hiervoor was, ik ga overstappen :)

edwardmp commented 6 months ago

Hey @eelcohn,

Thanks for creating this first of all! But I'm still running into the issue that it crashes after a few hours. I'm reading here and on Tweakers you've been working on a v2, any chance you can share what you already have on that front?