Closed wyattwww closed 11 months ago
I have not run into this
Thanks, I will have a look. I think its possible that the issue relates to an overflow (which I have thought I fixed in the past).
Can you tell me what is the exact time when this issue occurs (i.e. have something more precise than approx 2 hours). Is it possible that its more around 71 minutes 30 seconds?
I will try to reproduce this but two hours is quite a long time :) I wonder if we could time travel in code some how by chaning thing. I will have a look at that too :)
Well the excersize has not yet reached its aim, but I already fund another bug related to the wifi reconnect handling :) So this was very useful already.
So I disabled ENABLE_WEBSOCKET_MONITOR and there's no more crashing, with ENABLE_BLE_SERVICE only the device ran overnight emitting data without issue and the connection was stable.
The websocket crash is always just over 2 hours and a couple of minutes, even if there's no browser connected. I inspected the values and they are all within range. So I suspect there's a memory leak with websocket, or in the to_string conversion?
That is very interesting as the error shows that the issue seemingly originates from the string builder part of notifyClients. I know errors which comes from the socket and the stack trace there would include for instance a call to the textAll() method.
I was not able to reproduce this issue. I have run it for 3 hours and so far it worked fine. I attached the log output I have so far: output - Copy.json. I will keep running it overnight but in the meantime, can you please:
Thanks.
Yes, the board could be another vector... I'm using a Wemos D1 Mini ESP32. Attached test.array.h.
platformio.ini:
[env:esp32]
monitor_speed = 115200
platform = https://github.com/platformio/platform-espressif32.git#master
board = wemos_d1_mini32
board_upload.flash_size = 4MB
board_build.partitions = no_ota.csv
board_build.filesystem = littlefs
board_build.flash_mode = qio
framework = arduino
monitor_filters = esp32_exception_decoder
lib_deps =
h2zero/NimBLE-Arduino
thijse/ArduinoLog
ottowinter/ESPAsyncWebServer-esphome
platform_packages =
framework-arduinoespressif32 @ https://github.com/espressif/arduino-esp32#master
build_flags =
-std=c++2a
-std=gnu++2a
-O2
build_unflags =
-std=gnu++11
-ggdb
-Os
THanks, can you send me the settings.h please. platformio.ini is actually not relevant :)
I have one piece of this board I think, so if I cannot reproduce this issue with the doit esp32 devkit v1 that I generally use for standard development I will try the Wemos.
Just to confirm, your wemos is with the ESP32-WROOM-32 chipset right? Can you please check? Its important to determine whether its dual core or not as well as the clock speed.
thanks.
Run the thing whole night. No error. I will try with changing things arround. But I think this is an overflow issue that exists only in the test. Can you change the following line in the test.array.h file:
elapsedTime = accumulate(test.begin(), test.begin() + i, 0);
to this:
elapsedTime += accumulate(test.begin(), test.begin() + i, 0);
and test again? I think the issue is that the way the test is setup it is able to reach the max value of a 64bit uint. I have never made any guards against that as it should be years before that overflows :)
But I can see the way the test.array.h is implemented now (which it has a bug), it is possible to accumulate 2000 minutes within 30 minutes in real time. I suspect that when the overflow happens the string builder cannot handle it. Now this is just a theory that I am testing with the original delta times but since this takes a lot of time I would like to ask your help.
The odd thing is that when the overflow happens, the string builder should be able to handle in theory, as that is nothing special just need to write a smaller number, so I can be wrong here.
I am not able to reproduce this issue. I have tried it with the wemos32 mini D1. I never encountered this 2 hour mark issue. I have run it for many hours without an issue.
I can give it another go once you sent me your settings.h. May be there is something that cuases the issue, though I cannot think of anything that could create a consistent crash like this.
Ok thanks for trying. I will investigate further when I’m back online later. Thank you!
Here is my settings.h. It's from my fork with FTMS so there maybe a few new lines in there, please disregard and revert those.
I'll do more testing on another board, and from your repo too to see if I can isolate this.
Thanks!
The Wemos D1 Mini is WROOM-32 as well.
I made this change in test.array.h and it's been running for 12 hours now, cannot reproduce anymore. Not sure if that was the root cause but I'll mark close for now.
elapsedTime += accumulate(test.begin(), test.begin() + i, 0);
Thanks!
Finally I was able to reproduce the error with your settings. The issue is not with the code per se (at least not explicitly). I encountered this in the past, though it had a slightly different stack trace and that is why I was not able to recognise it). The issue is that for some reason always around two hours the delta times have such set of value that the stroke detection with your settings and test.array.h delta times looses track and never sees the end of the drive. This means that the driveHandleForce
values just get accumulated infinitely which results in the MCU running out of memory eventually. This happens rather early as the driveHandleForce
vector is passed by value (i.e. copy), so the call to the data gets very expensive over time, the MCU actually there is a copy this at least 3 times.
Based on my previous experience if it gets over 700 things get unstable. Crash occurs because the number of elements in the vector gets too big and the ESP32 runs out memory while building it. You can simulate the same/or similar error if you comment out the driveHandleForces.clear()
calls in the stroke.service.cpp
There are several solution:
+=
cummulative assignment solves the issue of not detecting drive end hence, the memory leakdriveHandelForces
variable that when it gets too big it resets it selfI am not a fan of passing by reference even if its a const so modification is disallowed. Not to mention that this approach would not actually solve the issue, as theoretically this vector can still get too big. Also I think there should be no real life scenario where there are more than a couple of hundreds of handle force datapoints. So I would go for 1 and 2.
Glad you were able to reproduce it too and thanks for the recommendations. I can confirm # 1 works, I will implement a check for # 2 and re-test some more.
Separately, may I get your email to connect? I have a couple other non-technical questions about the monitor.
Thanks! Wyatt
I will push the fixes shortly. I recommend you rebase or merge your changes in your fork on the updated master so everything is aligned and up to date (for instance I noticed that there is no DEVICE_NAME
setting in your settings.h.
On your non-technical questions about the monitor. Can you start a new thread in the discussion section instead? It would help keep the conversation better traceable as well as potentially provide information for others too :).
Hi,
I'm running into a crash when the device has been processing for approximately 2 hours. Reproduction steps:
Have you run into this before?
Thanks, Wyatt