Crash at 2-hour mark - Githubissues

wyattwww commented 11 months ago

Hi,

I'm running into a crash when the device has been processing for approximately 2 hours. Reproduction steps:

In main.cpp, enable simulateRotation()
In test.array.h, increase the number of simulations by setting i2 check to large number, e.g. i2 == 500.
Build and upload to the device and connect via WebGUI or watch the terminal output in platform.
Just after 2 hours, there is a panic_abort. The callstack is this:

Backtrace: 0x40083d45:0x3ffd41c0 0x400957e1:0x3ffd41e0 0x4009b519:0x3ffd4200 0x4017e5d7:0x3ffd4280 0x4017e61e:0x3ffd42a0 0x4017e57f:0x3ffd42c0 0x4017e9ae:0x3ffd42e0 0x400d380f:0x3ffd4300 0x400d490d:0x3ffd4320 0x400d49c1:0x3ffd4360 0x400d5ede:0x3ffd4380 0x400d67f1:0x3ffd4400 0x400d3533:0x3ffd44a0 0x400ff01d:0x3ffd4520

  #0  0x40083d45:0x3ffd41c0 in panic_abort at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_system/panic.c:402
  #1  0x400957e1:0x3ffd41e0 in esp_system_abort at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_system/esp_system.c:128
  #2  0x4009b519:0x3ffd4200 in abort at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/newlib/abort.c:46
  #3  0x4017e5d7:0x3ffd4280 in __cxxabiv1::__terminate(void (*)()) at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47
  #4  0x4017e61e:0x3ffd42a0 in std::terminate() at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57
  #5  0x4017e57f:0x3ffd42c0 in __cxa_throw at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95
  #6  0x4017e9ae:0x3ffd42e0 in operator new(unsigned int) at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/src/gcc/libstdc++-v3/libsupc++/new_op.cc:54
  #7  0x400d380f:0x3ffd4300 in _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE9_M_createERjj$isra$28 at bluetooth.service.cpp:?
  #8  0x400d490d:0x3ffd4320 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_mutate(unsigned int, unsigned int, char const*, unsigned int) at ??:?
  #9  0x400d49c1:0x3ffd4360 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_append(char const*, unsigned int) at ??:?
  #10 0x400d5ede:0x3ffd4380 in NetworkService::notifyClients(RowingDataModels::RowingMetrics, unsigned char, BleServiceFlag, ArduinoLogLevel) at ??:?
  #11 0x400d67f1:0x3ffd4400 in PeripheralsController::updateData(RowingDataModels::RowingMetrics) at ??:?
  #12 0x400d3533:0x3ffd44a0 in loop() at ??:?
  #13 0x400ff01d:0x3ffd4520 in loopTask(void*) at ??:?

Have you run into this before?

Thanks, Wyatt

Abasz commented 11 months ago

I have not run into this

Thanks, I will have a look. I think its possible that the issue relates to an overflow (which I have thought I fixed in the past).

Can you tell me what is the exact time when this issue occurs (i.e. have something more precise than approx 2 hours). Is it possible that its more around 71 minutes 30 seconds?

I will try to reproduce this but two hours is quite a long time :) I wonder if we could time travel in code some how by chaning thing. I will have a look at that too :)

Abasz commented 11 months ago

Well the excersize has not yet reached its aim, but I already fund another bug related to the wifi reconnect handling :) So this was very useful already.

wyattwww commented 11 months ago

So I disabled ENABLE_WEBSOCKET_MONITOR and there's no more crashing, with ENABLE_BLE_SERVICE only the device ran overnight emitting data without issue and the connection was stable.

The websocket crash is always just over 2 hours and a couple of minutes, even if there's no browser connected. I inspected the values and they are all within range. So I suspect there's a memory leak with websocket, or in the to_string conversion?

Abasz commented 11 months ago

That is very interesting as the error shows that the issue seemingly originates from the string builder part of notifyClients. I know errors which comes from the socket and the stack trace there would include for instance a call to the textAll() method.

Abasz commented 11 months ago

I was not able to reproduce this issue. I have run it for 3 hours and so far it worked fine. I attached the log output I have so far: output - Copy.json. I will keep running it overnight but in the meantime, can you please:

send me your settings.h file and test.array.h file please?
Which development board are you using?
Are you using latest master branch?

Thanks.

wyattwww commented 11 months ago

Yes, the board could be another vector... I'm using a Wemos D1 Mini ESP32. Attached test.array.h.

platformio.ini:

[env:esp32]
monitor_speed = 115200
platform = https://github.com/platformio/platform-espressif32.git#master
board = wemos_d1_mini32
board_upload.flash_size = 4MB
board_build.partitions = no_ota.csv
board_build.filesystem = littlefs
board_build.flash_mode = qio
framework = arduino
monitor_filters = esp32_exception_decoder
lib_deps = 
    h2zero/NimBLE-Arduino
    thijse/ArduinoLog
    ottowinter/ESPAsyncWebServer-esphome
platform_packages = 
    framework-arduinoespressif32 @ https://github.com/espressif/arduino-esp32#master
build_flags = 
    -std=c++2a
    -std=gnu++2a
    -O2
build_unflags = 
    -std=gnu++11
    -ggdb
    -Os

test.array.h.zip

Abasz commented 11 months ago

THanks, can you send me the settings.h please. platformio.ini is actually not relevant :)

Abasz commented 11 months ago

I have one piece of this board I think, so if I cannot reproduce this issue with the doit esp32 devkit v1 that I generally use for standard development I will try the Wemos.

Just to confirm, your wemos is with the ESP32-WROOM-32 chipset right? Can you please check? Its important to determine whether its dual core or not as well as the clock speed.

thanks.

Abasz commented 11 months ago

Run the thing whole night. No error. I will try with changing things arround. But I think this is an overflow issue that exists only in the test. Can you change the following line in the test.array.h file:

        elapsedTime = accumulate(test.begin(), test.begin() + i, 0);

to this:

        elapsedTime += accumulate(test.begin(), test.begin() + i, 0);

and test again? I think the issue is that the way the test is setup it is able to reach the max value of a 64bit uint. I have never made any guards against that as it should be years before that overflows :)

But I can see the way the test.array.h is implemented now (which it has a bug), it is possible to accumulate 2000 minutes within 30 minutes in real time. I suspect that when the overflow happens the string builder cannot handle it. Now this is just a theory that I am testing with the original delta times but since this takes a lot of time I would like to ask your help.

The odd thing is that when the overflow happens, the string builder should be able to handle in theory, as that is nothing special just need to write a smaller number, so I can be wrong here.

Abasz commented 11 months ago

I am not able to reproduce this issue. I have tried it with the wemos32 mini D1. I never encountered this 2 hour mark issue. I have run it for many hours without an issue.

I can give it another go once you sent me your settings.h. May be there is something that cuases the issue, though I cannot think of anything that could create a consistent crash like this.

wyattwww commented 11 months ago

Ok thanks for trying. I will investigate further when I’m back online later. Thank you!

wyattwww commented 11 months ago

Here is my settings.h. It's from my fork with FTMS so there maybe a few new lines in there, please disregard and revert those.

I'll do more testing on another board, and from your repo too to see if I can isolate this.

Thanks!

settings.h.zip

wyattwww commented 11 months ago

The Wemos D1 Mini is WROOM-32 as well.

wyattwww commented 11 months ago

I made this change in test.array.h and it's been running for 12 hours now, cannot reproduce anymore. Not sure if that was the root cause but I'll mark close for now.

elapsedTime += accumulate(test.begin(), test.begin() + i, 0);

Thanks!

Abasz commented 11 months ago

Finally I was able to reproduce the error with your settings. The issue is not with the code per se (at least not explicitly). I encountered this in the past, though it had a slightly different stack trace and that is why I was not able to recognise it). The issue is that for some reason always around two hours the delta times have such set of value that the stroke detection with your settings and test.array.h delta times looses track and never sees the end of the drive. This means that the driveHandleForce values just get accumulated infinitely which results in the MCU running out of memory eventually. This happens rather early as the driveHandleForce vector is passed by value (i.e. copy), so the call to the data gets very expensive over time, the MCU actually there is a copy this at least 3 times.

Based on my previous experience if it gets over 700 things get unstable. Crash occurs because the number of elements in the vector gets too big and the ESP32 runs out memory while building it. You can simulate the same/or similar error if you comment out the driveHandleForces.clear() calls in the stroke.service.cpp

There are several solution:

There is one bug in the test.array.h file, namely that elapsed time is not accumulated but rather reassigned before starting the new set. So changing the assignment to a += cummulative assignment solves the issue of not detecting drive end hence, the memory leak
Add a safeguard for the driveHandelForces variable that when it gets too big it resets it self
Pass data by reference (const of course)
Do all or some of the above

I am not a fan of passing by reference even if its a const so modification is disallowed. Not to mention that this approach would not actually solve the issue, as theoretically this vector can still get too big. Also I think there should be no real life scenario where there are more than a couple of hundreds of handle force datapoints. So I would go for 1 and 2.

wyattwww commented 11 months ago

Glad you were able to reproduce it too and thanks for the recommendations. I can confirm # 1 works, I will implement a check for # 2 and re-test some more.

Separately, may I get your email to connect? I have a couple other non-technical questions about the monitor.

Thanks! Wyatt

Abasz commented 11 months ago

I will push the fixes shortly. I recommend you rebase or merge your changes in your fork on the updated master so everything is aligned and up to date (for instance I noticed that there is no DEVICE_NAME setting in your settings.h.

On your non-technical questions about the monitor. Can you start a new thread in the discussion section instead? It would help keep the conversation better traceable as well as potentially provide information for others too :).

Abasz / ESPRowingMonitor

Crash at 2-hour mark #6