Closed anecdata closed 10 months ago
@anecdata Can you do it with a debug build and reply with the backtrace?
I haven't been set up to build for almost a year, but I can work on that when I get back from my upcoming trip.
I tried to reproduce this with the test program above, with the URL's you gave, and also with some of my own. I ran trials for several hours.
I couldn't reproduce, but I have pretty reliable wifi. I also shortened the fetching interval a lot, to see if that made any difference, and tried just http and just https. So maybe it is due to some external failure?
I have pretty robust commercial-grade wifi. Signal strengths are good and disconnections are very rare. I've been battling this for months on a range of projects, most common remaining theme is httpserver + requests client. It's very reproducible for me, though time between safe modes varies quite a bit. It's hard to rule out some external factor, I just can't think of what would account for this.
I also split up the test program above: 4 QT Py ESP32-S2, one running exactly as above, one running the HTTP URL only, one running the HTTPS URL only, and one doing both each loop. Oddly, the one doing only HTTPS has not crashed at all (almost two days), but all of the others have entered safe mode a dozen or more times since I started them up - the count is misleading since it may have died many hours before by the time I find it. No wi-fi disconnections in all that time.
I'm actually using an edited adafruit_requests.py that has print statements so I see when internal exceptions are caught and retried (otherwise invisible to user code). I don't know if this is unusual compared to others, but Requests almost always fails on the first internal try at this line: https://github.com/adafruit/Adafruit_CircuitPython_Requests/blob/203d0b1489cb90a39f8a780570287f1a5bd610a0/adafruit_requests.py#L679 but then almost always succeeds on the automatic internal retry. Other internal exceptions occasionally occur, but in virtually all cases, the Request is successful before internal retries are exhausted, so it doesn't seem like this is a (major) problem. At least I wouldn't expect it to trigger safe mode.
I don't have a build setup currently. Scott offered to make a debug build. I'm here through tomorrow and could try to get the debug trace. Then I'm gone until late next week. After that, I can devote some time to getting building going again.
@anecdata thanks for the detailed reply. Here is a QT Py ESP32-S2 debug build (ulab is turned off) from the current tip of main, 1ab1cd3. qtpy_esp32s2_debug.zip [EDIT: no DEBUG UART pins, oops]
Prep DevKit N4R2 debug build: esptool erase flash, storage.erase_filesystem(), esptool firmware.bin, install code and support files, microcontroller.reset().
(some time later...)
Having some trouble with TLS requests. This S3 build works:
Adafruit CircuitPython 8.0.0-beta.6-45-g1ab1cd3ba on 2023-01-19; ESP32-S2-DevKitC-1-N4R2 with ESP32S2
but not this debug build:
Adafruit CircuitPython 8.0.0-beta.6-45-g1ab1cd3ba-dirty on 2023-01-18; ESP32-S2-DevKitC-1-N4R2 with ESP32S2
I'll let it keep running with HTTP site only and see what happens.
Not sure why one build would work, but not another.
I also am seeing https failing on the DEBUG build, on a Metro ESP32-S2. I tried another https host, same thing. I think this is an important clue, indicating maybe there is some uninitialized variable, or we're ignoring error returns at a crucial point, or something.
I would like to speed up trying to get the crash happen. Your test program has a 5 minute cycle. I tried faster and that speed. Do you think trying faster might prevent the crash for some reason (no socket timeout, etc.)?
I'm not sure, I wondered that last night and sped mine up for a time, but didn't notice anything. Very unscientific.
BTW, the debug build is still running on the DevKit from last night, exceptions on the HTTPS url but success on the HTTP adafruit url. Odd, since the original runs HTTPS-only fine but safe modes on HTTP-only. Could even be a data size / alignment issue somewhere?
I'm going to load up one of my earlier more complex projects and see if I can cause safe mode on the debug build
Addendum: the earlier code has not crashed yet either on the debug build. Any of these project iterations would fail relatively soon on any release build. Are there still optimization differences in the debug build.
Problem on the DEBUG build is that sock.connect((connect_host, port))
in _get_socket()
in adafruit_requests
for the https request. The same connect()
succeeds on non-DEBUG build. I will pursue that further for at least a while.
Got a halt rather than safe mode, looks like the address is corrupted though, decode_backtrace can't decode it:
Guru Meditation Error: Core 0 panic'ed (Unhandled debug exception).
Debug exception reason: BREAK instr
Core 0 register dump:
PC : 0x400283c0 PS : 0x00060a36 A0 : 0x3ffd98a4 A1 : 0x3ffd98ac
A2 : 0x3f4c0058 A3 : 0x00000001 A4 : 0x80036348 A5 : 0x3ffe7b40
A6 : 0x00060023 A7 : 0x00000003 A8 : 0x00060023 A9 : 0x3ffe7bc0
A10 : 0x3ffd1b64 A11 : 0xffffffff A12 : 0x00000000 A13 : 0x00000000
A14 : 0x3ffe6c8c A15 : 0x3ffe6c40 SAR : 0x00000014 EXCCAUSE: 0x00000001
EXCVADDR: 0xfffffff0 LBEG : 0x00000000 LEND : 0x00000000 LCOUNT : 0x4002ef88
Backtrace: 0x400283bd:0x3ffd98ac |<-CORRUPTED
ELF file SHA256: 53665fd10c530e5a
CPU halted.
I'll keep trying to get something decode-able.
Addendum:
Different halt, also corrupted and not decode-able:
E (14026) esp-x509-crt-bundle: PK verify failed with error FFFFBD70
E (14026) esp-x509-crt-bundle: Failed to verify certificate
E (14036) esp-tls-mbedtls: mbedtls_ssl_handshake returned -0x3000
E (14036) esp-tls: Failed to open new connection
Guru Meditation Error: Core 0 panic'ed (Unhandled debug exception).
Debug exception reason: BREAK instr
Core 0 register dump:
PC : 0x400283c0 PS : 0x00060036 A0 : 0x00060023 A1 : 0x3ffe7bc0
A2 : 0x3ffd1b64 A3 : 0xffffffff A4 : 0x00000000 A5 : 0x00000000
A6 : 0x3ffe6c8c A7 : 0x3ffe6c40 A8 : 0x80036348 A9 : 0x3ffe7bb0
A10 : 0x00060a23 A11 : 0x00000000 A12 : 0x00060a20 A13 : 0x00000001
A14 : 0x3ffd9bb8 A15 : 0x00000001 SAR : 0x0000001e EXCCAUSE: 0x00000001
EXCVADDR: 0xfffffff1 LBEG : 0x00060a20 LEND : 0x00000001 LCOUNT : 0x4002ef88
Backtrace: 0x400283bd:0x3ffe7bc0 0x00060020:0x3ffe55f0 |<-CORRUPTED
ELF file SHA256: 53665fd10c530e5a
CPU halted.
Addendum 2:
I keep getting halts, but the Backtrace is always corrupted and not decode-able. I'll try some different code.
When DEBUG=1
, there is something wrong with the certificate verification (same errors as in the previous post):
E (13018) esp-x509-crt-bundle: PK verify failed with error FFFFBD70
E (13018) esp-x509-crt-bundle: Failed to verify certificate
E (13018) esp-tls-mbedtls: mbedtls_ssl_handshake returned -0x3000
E (13028) esp-tls: Failed to open new connection
@anecdata, when you saw the cert errors shown in your post, was DEBUG=1
?
I think I have narrowed it down to some difference between the sdkconfig-debug.defaults
and sdkconfig-opt.defaults
. It does not seem to have to do with the optimization level.
Yes, those messages only come up in the debug build (the one you sent).
This could be a completely different issue than the safe mode then.
This could be a completely different issue than the safe mode then.
Yes, I think so. I'm now trying to turn on the log messages without turning on the other DEBUG stuff.
FFFFBD70 is -17040 is possibly a memory allocation failure but I'll be danged if I can actually figure out mbedtls error numbers. https://github.com/espressif/arduino-esp32/issues/2286
-0x3000 may be a key exchange error? https://github.com/espressif/mbedtls/blob/89da85968417ce4a163b7275cb22f159995c271d/include/mbedtls/error.h
Finally got an uncorrupted backtrace (DEBUG=1 build that errors on HTTPS):
E (223807636) esp-x509-crt-bundle: PK verify failed with error FFFFBD70
E (223807636) esp-x509-crt-bundle: Failed to verify certificate
E (223807636) esp-tls-mbedtls: mbedtls_ssl_handshake returned -0x3000
E (223807646) esp-tls: Failed to open new connection
Guru Meditation Error: Core 0 panic'ed (StoreProhibited). Exception was unhandled.
Core 0 register dump:
PC : 0x40038b0e PS : 0x00060433 A0 : 0x800396f0 A1 : 0x3ffdcff0
A2 : 0x3ffdf724 A3 : 0x3fff47e4 A4 : 0x80036348 A5 : 0x3ffe55b0
A6 : 0x00060023 A7 : 0x00000003 A8 : 0x00000008 A9 : 0x00000000
A10 : 0x3ffdf744 A11 : 0x00000010 A12 : 0x3ffe7bb0 A13 : 0x80036348
A14 : 0x00060823 A15 : 0x80036348 SAR : 0x0000001c EXCCAUSE: 0x0000001d
EXCVADDR: 0x0006082b LBEG : 0x3ffe7bb0 LEND : 0x80036348 LCOUNT : 0x4002cc18
Backtrace: 0x40038b0b:0x3ffdcff0 0x400396ed:0x3ffdd010 0x4002cfdd:0x3ffdd030 0x4002cff1:0x3ffdd060 0x4002d01e:0x3ffdd080 0x4003ad31:0x3ffdd0a0 0x400f4e18:0x3ffdd0c0 0x400f4e96:0x3ffdd0e0 0x400f4ef9:0x3ffdd100 0x400f3d90:0x3ffdd120 0x400f3a54:0x3ffdd140 0x400b7600:0x3ffdd180
ELF file SHA256: 53665fd10c530e5a
CPU halted.
? Backtrace: 0x40038b0b:0x3ffdcff0 0x400396ed:0x3ffdd010 0x4002cfdd:0x3ffdd030 0x4002cff1:0x3ffdd060 0x4002d01e:0x3ffdd080 0x4003ad31:0x3ffdd0a0 0x400f4e18:0x3ffdd0c0 0x400f4e96:0x3ffdd0e0 0x400f4ef9:0x3ffdd100 0x400f3d90:0x3ffdd120 0x400f3a54:0x3ffdd140 0x400b7600:0x3ffdd180
got ['0x40038b0b', '0x400396ed', '0x4002cfdd', '0x4002cff1', '0x4002d01e', '0x4003ad31', '0x400f4e18', '0x400f4e96', '0x400f4ef9', '0x400f3d90', '0x400f3a54', '0x400b7600']
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/heap/heap_tlsf.c:212
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/heap/multi_heap.c:197
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/heap/heap_caps.c:154
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/heap/heap_caps.c:174
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/heap/heap_caps.c:199
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/newlib/heap.c:24
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/lwip/lwip/src/core/mem.c:237
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/lwip/lwip/src/core/memp.c:254
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/lwip/lwip/src/core/memp.c:350 (discriminator 2)
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/lwip/lwip/src/api/tcpip.c:321 (discriminator 2)
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/lwip/lwip/src/api/sockets.c:2905
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/lwip/lwip/src/include/lwip/sockets.h:636
206 /* Remove a free block from the free list.*/
207 static inline __attribute__((__always_inline__)) void remove_free_block(cont rol_t* control, block_header_t* block, int fl, int sl)
208 {
209 block_header_t* prev = block->prev_free;
210 block_header_t* next = block->next_free;
211 tlsf_assert(prev && "prev_free field can not be null");
212 tlsf_assert(next && "next_free field can not be null");
The esp-idf allocation heap is corrupted, such as by writing before/after an allocated object, writing to memory address that's been deallocated previously, etc. Unfortunately, this doesn't point us any closer to where the incorrect memory write occurred.
This backtrace may reflect a DEBUG=1
error only, as I was unable to get this to work at all with DEBUG=1. Or perhaps it's an earlier indicator of what the other error is.
I have a project on the Unexpected Maker Feather S2 that enters safe mode with "Crash into the HardFault_Handler." or "Internal watchdog timer expired." This project doesn't have issues under 7.3.3.
Unlike anecdata's project, mine uses adafruit_mqtt not the adafruit_httpserver or adafruit_request libraries. Mine also seems to run for a couple days before crashing. I can open a different ticket, if people think my issue is unrelated.
I'm seeing the same bug on a ESP32-S2 Reverse TFT. It's too new to have a 7.3.3 build, so I was unable to test if it worked without crashing on 7.3.3.
I'm using adafruit_requests
to make one GET every 30 seconds. I've tried upping that to 60 seconds, but it still crashes, usually crashes within about 20 minutes, but sometimes as long as 2 hours. Like anecdate, I get one of the same two errors when entering safe mode:
Usually:
You are in safe mode because: CircuitPython core code crashed hard. Whoops! Crash into the HardFault_Handler.
Occasionally:
You are in safe mode because: Internal watchdog timer expired.
If there's anything else I can share to help, please let me know.
I just wanted to note that I have a case where I have two identical boards set up identically at the same time: erase flash first, then install .bin
. One unit has way more safe modes during run than the other. It always has, but still happening when set up identically makes me wonder if there's something marginal in the flash or hardware.
@anecdata have you checked the 3.3V line to see if the voltages differ substantially?
And are they talking to the same host? Is one always started after another? Maybe one is stalling on a network operation?
If the voltages are good, maybe try lowering psram speed and cpu freq temporarily to see if this goes away?
Same host. Both at 3.28v. Reloads, resets, and safe modes occur at seemingly random times . code.py
reloads for various exceptions and for a certain number of consecutive failures in wifi or network operations, and resets if there have been a certain number of consecutive reloads. Safe mode triggers on this particular project due to an intermittent bug in wifi monitor init that I haven't had time to dig in to deeper (need to file an issue). safemode.py
resets it.
I can play with build params at some point, but I'm inclined to just swap out the "flaky" one for now to reduce downtime. I thought it was worth sharing that some safe modes may not be due to software.
@anecdata When you last looked at this, it seemed like it was board sample dependent, if I understand correctly. Should we leave this open?
I meant only to note that in one instance, I had different apparent rates of safe mode between two otherwise identical boards.
The issue is still open, almost any non-trivial wifi / requests project I put together eventually goes into safe mode with recent 8.x.x (newer esp-idf is in there too, of course). I have a number of similar long-running 7.3.3 devices that never go into safe mode (though project code has evolved since then). It's difficult to characterize the issue since the safe modes are intermittent. I have since seen safe mode on requests-only projects with 8.1.0-beta.2 with no major complicating factors like http server, async, etc.
Further complicating things is the change in behavior with a DEBUG build, so we don't seem to be able to get an accurate debug backtrace.
It's also possible that the root cause for this is the same as other 8.x.x safe mode issues that have been filed.
Please re-test after idf 5 upgrade
Closing. I think enough code has changed that fresh testing and new minimal example will be needed if this is still occurring.
CircuitPython version
Code/REPL
Behavior
8.0.0-beta code on ESP2-S2 enters safe mode under certain circumstances I can't identify exactly.
Same projects on ESP32-S2 7.3.3 never enter safe mode. Same projects on Pico W never enter safe mode.
Characteristics of projects entering safe mode:
boot.py
doesn't seem to matterBehavior: after some indeterminate duration (typically some hours)...
Usually:
Occasionally:
I just started up 4 with the above code a half hour ago, fresh from a reset, and three have crashed: two hardfault and one internal watchdog. It always enters safe mode after the "Listening on http://..." print statement, most likely in
server.poll()
, and if so then most likely in_sock.accept()
.Description
No response
Additional information
No response