Some ESP32-S2 8.0.0-beta code enters safe mode. 7.3.3 does not. Pico W does not.

anecdata commented 1 year ago

CircuitPython version

Adafruit CircuitPython 8.0.0-beta.6-40-gcc6dbb373 on 2023-01-16; Adafruit QT Py ESP32S2 with ESP32S2
# and earlier betas... not sure when it started, but at least back to October or so

Code/REPL

import time
import random
import traceback
import board
import digitalio
import neopixel
import wifi
import socketpool
import ssl
from adafruit_httpserver.server import HTTPServer
from adafruit_httpserver.request import HTTPRequest
from adafruit_httpserver.response import HTTPResponse
from adafruit_httpserver.methods import HTTPMethod
from adafruit_httpserver.mime_type import MIMEType
import adafruit_requests

from secrets import secrets

PORT = 8080
URLS = ("http://wifitest.adafruit.com/testwifi/index.html",
        "https://httpbin.org/get", )
ONE_SECOND_NS   =   1_000_000_000
FIVE_MINUTES_NS = 300_000_000_000

def get_led():
    led = None
    if hasattr(board, "LED"):
        led = digitalio.DigitalInOut(board.LED)
    else:
        print(f"No LED")
    if led: led.switch_to_output(False)
    return led

def get_pix():
    pix = []
    num_pix = 1
    if hasattr(board, "NEOPIXEL"):
        import neopixel
        pix = neopixel.NeoPixel(board.NEOPIXEL, num_pix, brightness=0.25, pixel_order=neopixel.GRB)
    elif hasattr(board, "APA102_SCK"):  # UMFeatherS2; FunHouse & Itsy Bitsy M4 also have DOTSTAR_CLOCK
        import adafruit_dotstar
        pix = adafruit_dotstar.DotStar(board.APA102_SCK, board.APA102_MOSI, num_pix, brightness=0.25)
    else:
        print(f"No NeoPixel. No Dotstar")
    if pix: pix[0] = 0
    return pix

def connect():
    if pix: pix[0] = (0, 255, 255)  # green-blue when connecting
    print(f"{time.monotonic_ns()} Connected? ", end="")
    consecutive_connect_errors = 0
    while not wifi.radio.ipv4_address:  # check for 0.0.0.0 ...
        try:
            print(f"Connecting... ", end="")
            wifi.radio.connect(secrets["ssid"], secrets["password"])
        except ConnectionError as e:
            consecutive_connect_errors += 1
            print(f"ConnectionError: {e} consecutive={consecutive_connect_errors}", end=" ")
            time.sleep(1)
            if consecutive_connect_errors > 8:
                print(f"Too many consecutive connect failures. Reloading...")
                supervisor.reload()
    time.sleep(0.0100)  # Pico W wifi.radio.ipv4_address can lag wifi.radio.connect by tens of ms
    print(f"{wifi.radio.ipv4_address} ", end="")
    if hasattr(wifi.radio, "ap_info.rssi") and wifi.radio.ap_info:
        time.sleep(1)  # ap_info takes a moment to be valid
        print(f"RSSI={wifi.radio.ap_info.rssi} ", end="")
    print()
    if pix: pix[0] = 0

def poll():
    if not wifi.radio.ipv4_address:
        connect()

    if pix: pix[0] = (255, 0, 255)  # violet for polling
    try:
        server.poll()
    except Exception as ex:
        traceback.print_exception(ex, ex, ex.__traceback__)
    if pix: pix[0] = 0

def get_url(url):
    if not wifi.radio.ipv4_address:
        connect()

    if pix: pix[0] = (0, 0, 255)  # blue for requests
    try:
        with requests.request("GET", url, timeout=5,) as r:
            print(f"\n{time.monotonic_ns()} {r.status_code} {r.reason.decode('utf-8', 'ignore')} {r.headers}")
    except Exception as e:
        traceback.print_exception(e, e, e.__traceback__)
    if pix: pix[0] = 0

led = get_led()
pix = get_pix()

connect()
pool = socketpool.SocketPool(wifi.radio)
server = HTTPServer(pool)
server.start(str(wifi.radio.ipv4_address), port=PORT)
requests = adafruit_requests.Session(pool, ssl.create_default_context())

@server.route("/")
def base(request: HTTPRequest):
    with HTTPResponse(request, content_type=MIMEType.TYPE_HTML) as response:
        print(f"\n{time.monotonic_ns()} Serving route...")
        response.send_file("index.html")

one_second_start = time.monotonic_ns() - ONE_SECOND_NS
five_minute_start = time.monotonic_ns() - FIVE_MINUTES_NS
while True:

    if (time.monotonic_ns() - one_second_start) >= ONE_SECOND_NS:
        if led: led.value = not led.value
        print(f"{time.monotonic_ns()} Listening on http://{wifi.radio.ipv4_address}:{PORT}", end="\r")
        one_second_start = time.monotonic_ns()

    try:
        poll()
    except OSError as e:
        traceback.print_exception(e, e, e.__traceback__)

    if (time.monotonic_ns() - five_minute_start) >= FIVE_MINUTES_NS:
        print(f"\n{time.monotonic_ns()} Requesting url...")
        url = random.choice(URLS)
        get_url(url)
        five_minute_start = time.monotonic_ns()

Behavior

8.0.0-beta code on ESP2-S2 enters safe mode under certain circumstances I can't identify exactly.

Same projects on ESP32-S2 7.3.3 never enter safe mode. Same projects on Pico W never enter safe mode.

Characteristics of projects entering safe mode:

always involves httpserver (possibly httpserver combined with requests)
several relatively-complex requests-only (no httpserver) projects never enter safe mode
doesn't seem to matter if httpserver ever serves any routes
web workflow on or off doesn't seem to matter (lately I've been testing with web workflow off)
async or not doesn't seem to matter, though for a long time I though it might since one async webclient-only (no httpserver) project also goes into safe mode (lately I've been testing projects reverted away from async)
presence or absence of boot.py doesn't seem to matter
library versions don't seem to matter - this has been occurring for months (latest tests using 20230110)
various versions and rewrites to try to get a minimal example all go into safe mode - finally got a shareable code example (above) that's still somewhat large and probably has features it doesn't need, but when I've made it too simple it doesn't enter safe mode (I'll keep trying to simplify the example)

Behavior: after some indeterminate duration (typically some hours)...

Usually:

You are in safe mode because: CircuitPython core code crashed hard. Whoops! Crash into the HardFault_Handler.

Occasionally:

You are in safe mode because: Internal watchdog timer expired.

I just started up 4 with the above code a half hour ago, fresh from a reset, and three have crashed: two hardfault and one internal watchdog. It always enters safe mode after the "Listening on http://..." print statement, most likely in server.poll(), and if so then most likely in _sock.accept().

Description

No response

Additional information

No response

tannewt commented 1 year ago

@anecdata Can you do it with a debug build and reply with the backtrace?

anecdata commented 1 year ago

I haven't been set up to build for almost a year, but I can work on that when I get back from my upcoming trip.

dhalbert commented 1 year ago

I tried to reproduce this with the test program above, with the URL's you gave, and also with some of my own. I ran trials for several hours.

I couldn't reproduce, but I have pretty reliable wifi. I also shortened the fetching interval a lot, to see if that made any difference, and tried just http and just https. So maybe it is due to some external failure?

anecdata commented 1 year ago

I have pretty robust commercial-grade wifi. Signal strengths are good and disconnections are very rare. I've been battling this for months on a range of projects, most common remaining theme is httpserver + requests client. It's very reproducible for me, though time between safe modes varies quite a bit. It's hard to rule out some external factor, I just can't think of what would account for this.

I also split up the test program above: 4 QT Py ESP32-S2, one running exactly as above, one running the HTTP URL only, one running the HTTPS URL only, and one doing both each loop. Oddly, the one doing only HTTPS has not crashed at all (almost two days), but all of the others have entered safe mode a dozen or more times since I started them up - the count is misleading since it may have died many hours before by the time I find it. No wi-fi disconnections in all that time.

I'm actually using an edited adafruit_requests.py that has print statements so I see when internal exceptions are caught and retried (otherwise invisible to user code). I don't know if this is unusual compared to others, but Requests almost always fails on the first internal try at this line: https://github.com/adafruit/Adafruit_CircuitPython_Requests/blob/203d0b1489cb90a39f8a780570287f1a5bd610a0/adafruit_requests.py#L679 but then almost always succeeds on the automatic internal retry. Other internal exceptions occasionally occur, but in virtually all cases, the Request is successful before internal retries are exhausted, so it doesn't seem like this is a (major) problem. At least I wouldn't expect it to trigger safe mode.

I don't have a build setup currently. Scott offered to make a debug build. I'm here through tomorrow and could try to get the debug trace. Then I'm gone until late next week. After that, I can devote some time to getting building going again.

dhalbert commented 1 year ago

@anecdata thanks for the detailed reply. Here is a QT Py ESP32-S2 debug build (ulab is turned off) from the current tip of main, 1ab1cd3. qtpy_esp32s2_debug.zip [EDIT: no DEBUG UART pins, oops]

anecdata commented 1 year ago

Prep DevKit N4R2 debug build: esptool erase flash, storage.erase_filesystem(), esptool firmware.bin, install code and support files, microcontroller.reset().

(some time later...) Having some trouble with TLS requests. This S3 build works: Adafruit CircuitPython 8.0.0-beta.6-45-g1ab1cd3ba on 2023-01-19; ESP32-S2-DevKitC-1-N4R2 with ESP32S2 but not this debug build: Adafruit CircuitPython 8.0.0-beta.6-45-g1ab1cd3ba-dirty on 2023-01-18; ESP32-S2-DevKitC-1-N4R2 with ESP32S2 I'll let it keep running with HTTP site only and see what happens. Not sure why one build would work, but not another.

dhalbert commented 1 year ago

I also am seeing https failing on the DEBUG build, on a Metro ESP32-S2. I tried another https host, same thing. I think this is an important clue, indicating maybe there is some uninitialized variable, or we're ignoring error returns at a crucial point, or something.
I would like to speed up trying to get the crash happen. Your test program has a 5 minute cycle. I tried faster and that speed. Do you think trying faster might prevent the crash for some reason (no socket timeout, etc.)?

anecdata commented 1 year ago

I'm not sure, I wondered that last night and sped mine up for a time, but didn't notice anything. Very unscientific.

BTW, the debug build is still running on the DevKit from last night, exceptions on the HTTPS url but success on the HTTP adafruit url. Odd, since the original runs HTTPS-only fine but safe modes on HTTP-only. Could even be a data size / alignment issue somewhere?

I'm going to load up one of my earlier more complex projects and see if I can cause safe mode on the debug build

Addendum: the earlier code has not crashed yet either on the debug build. Any of these project iterations would fail relatively soon on any release build. Are there still optimization differences in the debug build.

dhalbert commented 1 year ago

Problem on the DEBUG build is that sock.connect((connect_host, port)) in _get_socket() in adafruit_requests for the https request. The same connect() succeeds on non-DEBUG build. I will pursue that further for at least a while.

anecdata commented 1 year ago

Got a halt rather than safe mode, looks like the address is corrupted though, decode_backtrace can't decode it:

Guru Meditation Error: Core  0 panic'ed (Unhandled debug exception). 
Debug exception reason: BREAK instr 
Core  0 register dump:
PC      : 0x400283c0  PS      : 0x00060a36  A0      : 0x3ffd98a4  A1      : 0x3ffd98ac  
A2      : 0x3f4c0058  A3      : 0x00000001  A4      : 0x80036348  A5      : 0x3ffe7b40  
A6      : 0x00060023  A7      : 0x00000003  A8      : 0x00060023  A9      : 0x3ffe7bc0  
A10     : 0x3ffd1b64  A11     : 0xffffffff  A12     : 0x00000000  A13     : 0x00000000  
A14     : 0x3ffe6c8c  A15     : 0x3ffe6c40  SAR     : 0x00000014  EXCCAUSE: 0x00000001  
EXCVADDR: 0xfffffff0  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x4002ef88  

Backtrace: 0x400283bd:0x3ffd98ac |<-CORRUPTED
ELF file SHA256: 53665fd10c530e5a

CPU halted.

I'll keep trying to get something decode-able.

Addendum:

Different halt, also corrupted and not decode-able:

E (14026) esp-x509-crt-bundle: PK verify failed with error FFFFBD70
E (14026) esp-x509-crt-bundle: Failed to verify certificate
E (14036) esp-tls-mbedtls: mbedtls_ssl_handshake returned -0x3000
E (14036) esp-tls: Failed to open new connection
Guru Meditation Error: Core  0 panic'ed (Unhandled debug exception). 
Debug exception reason: BREAK instr 
Core  0 register dump:
PC      : 0x400283c0  PS      : 0x00060036  A0      : 0x00060023  A1      : 0x3ffe7bc0  
A2      : 0x3ffd1b64  A3      : 0xffffffff  A4      : 0x00000000  A5      : 0x00000000  
A6      : 0x3ffe6c8c  A7      : 0x3ffe6c40  A8      : 0x80036348  A9      : 0x3ffe7bb0  
A10     : 0x00060a23  A11     : 0x00000000  A12     : 0x00060a20  A13     : 0x00000001  
A14     : 0x3ffd9bb8  A15     : 0x00000001  SAR     : 0x0000001e  EXCCAUSE: 0x00000001  
EXCVADDR: 0xfffffff1  LBEG    : 0x00060a20  LEND    : 0x00000001  LCOUNT  : 0x4002ef88  

Backtrace: 0x400283bd:0x3ffe7bc0 0x00060020:0x3ffe55f0 |<-CORRUPTED

ELF file SHA256: 53665fd10c530e5a

CPU halted.

Addendum 2:

I keep getting halts, but the Backtrace is always corrupted and not decode-able. I'll try some different code.

dhalbert commented 1 year ago

When DEBUG=1, there is something wrong with the certificate verification (same errors as in the previous post):

E (13018) esp-x509-crt-bundle: PK verify failed with error FFFFBD70
E (13018) esp-x509-crt-bundle: Failed to verify certificate
E (13018) esp-tls-mbedtls: mbedtls_ssl_handshake returned -0x3000
E (13028) esp-tls: Failed to open new connection

@anecdata, when you saw the cert errors shown in your post, was DEBUG=1?

I think I have narrowed it down to some difference between the sdkconfig-debug.defaults and sdkconfig-opt.defaults. It does not seem to have to do with the optimization level.

anecdata commented 1 year ago

Yes, those messages only come up in the debug build (the one you sent).

This could be a completely different issue than the safe mode then.

dhalbert commented 1 year ago

This could be a completely different issue than the safe mode then.

Yes, I think so. I'm now trying to turn on the log messages without turning on the other DEBUG stuff.

jepler commented 1 year ago

FFFFBD70 is -17040 is possibly a memory allocation failure but I'll be danged if I can actually figure out mbedtls error numbers. https://github.com/espressif/arduino-esp32/issues/2286

anecdata commented 1 year ago

-0x3000 may be a key exchange error? https://github.com/espressif/mbedtls/blob/89da85968417ce4a163b7275cb22f159995c271d/include/mbedtls/error.h

anecdata commented 1 year ago

Finally got an uncorrupted backtrace (DEBUG=1 build that errors on HTTPS):

E (223807636) esp-x509-crt-bundle: PK verify failed with error FFFFBD70
E (223807636) esp-x509-crt-bundle: Failed to verify certificate
E (223807636) esp-tls-mbedtls: mbedtls_ssl_handshake returned -0x3000
E (223807646) esp-tls: Failed to open new connection
Guru Meditation Error: Core  0 panic'ed (StoreProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x40038b0e  PS      : 0x00060433  A0      : 0x800396f0  A1      : 0x3ffdcff0  
A2      : 0x3ffdf724  A3      : 0x3fff47e4  A4      : 0x80036348  A5      : 0x3ffe55b0  
A6      : 0x00060023  A7      : 0x00000003  A8      : 0x00000008  A9      : 0x00000000  
A10     : 0x3ffdf744  A11     : 0x00000010  A12     : 0x3ffe7bb0  A13     : 0x80036348  
A14     : 0x00060823  A15     : 0x80036348  SAR     : 0x0000001c  EXCCAUSE: 0x0000001d  
EXCVADDR: 0x0006082b  LBEG    : 0x3ffe7bb0  LEND    : 0x80036348  LCOUNT  : 0x4002cc18  

Backtrace: 0x40038b0b:0x3ffdcff0 0x400396ed:0x3ffdd010 0x4002cfdd:0x3ffdd030 0x4002cff1:0x3ffdd060 0x4002d01e:0x3ffdd080 0x4003ad31:0x3ffdd0a0 0x400f4e18:0x3ffdd0c0 0x400f4e96:0x3ffdd0e0 0x400f4ef9:0x3ffdd100 0x400f3d90:0x3ffdd120 0x400f3a54:0x3ffdd140 0x400b7600:0x3ffdd180

ELF file SHA256: 53665fd10c530e5a

CPU halted.

? Backtrace: 0x40038b0b:0x3ffdcff0 0x400396ed:0x3ffdd010 0x4002cfdd:0x3ffdd030 0x4002cff1:0x3ffdd060 0x4002d01e:0x3ffdd080 0x4003ad31:0x3ffdd0a0 0x400f4e18:0x3ffdd0c0 0x400f4e96:0x3ffdd0e0 0x400f4ef9:0x3ffdd100 0x400f3d90:0x3ffdd120 0x400f3a54:0x3ffdd140 0x400b7600:0x3ffdd180
got ['0x40038b0b', '0x400396ed', '0x4002cfdd', '0x4002cff1', '0x4002d01e', '0x4003ad31', '0x400f4e18', '0x400f4e96', '0x400f4ef9', '0x400f3d90', '0x400f3a54', '0x400b7600']
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/heap/heap_tlsf.c:212
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/heap/multi_heap.c:197
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/heap/heap_caps.c:154
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/heap/heap_caps.c:174
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/heap/heap_caps.c:199
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/newlib/heap.c:24
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/lwip/lwip/src/core/mem.c:237
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/lwip/lwip/src/core/memp.c:254
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/lwip/lwip/src/core/memp.c:350 (discriminator 2)
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/lwip/lwip/src/api/tcpip.c:321 (discriminator 2)
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/lwip/lwip/src/api/sockets.c:2905
/home/halbert/repos/circuitpython/ports/espressif/esp-idf/components/lwip/lwip/src/include/lwip/sockets.h:636

jepler commented 1 year ago

206 /* Remove a free block from the free list.*/
207 static inline __attribute__((__always_inline__)) void remove_free_block(cont    rol_t* control, block_header_t* block, int fl, int sl)
208 {
209         block_header_t* prev = block->prev_free;
210         block_header_t* next = block->next_free;
211         tlsf_assert(prev && "prev_free field can not be null");
212         tlsf_assert(next && "next_free field can not be null");

The esp-idf allocation heap is corrupted, such as by writing before/after an allocated object, writing to memory address that's been deallocated previously, etc. Unfortunately, this doesn't point us any closer to where the incorrect memory write occurred.

dhalbert commented 1 year ago

This backtrace may reflect a DEBUG=1 error only, as I was unable to get this to work at all with DEBUG=1. Or perhaps it's an earlier indicator of what the other error is.

dgriswo commented 1 year ago

I have a project on the Unexpected Maker Feather S2 that enters safe mode with "Crash into the HardFault_Handler." or "Internal watchdog timer expired." This project doesn't have issues under 7.3.3.

Unlike anecdata's project, mine uses adafruit_mqtt not the adafruit_httpserver or adafruit_request libraries. Mine also seems to run for a couple days before crashing. I can open a different ticket, if people think my issue is unrelated.

prcutler commented 1 year ago

I'm seeing the same bug on a ESP32-S2 Reverse TFT. It's too new to have a 7.3.3 build, so I was unable to test if it worked without crashing on 7.3.3.

I'm using adafruit_requests to make one GET every 30 seconds. I've tried upping that to 60 seconds, but it still crashes, usually crashes within about 20 minutes, but sometimes as long as 2 hours. Like anecdate, I get one of the same two errors when entering safe mode:

Usually:

You are in safe mode because:
CircuitPython core code crashed hard. Whoops!
Crash into the HardFault_Handler.

Occasionally:

You are in safe mode because:
Internal watchdog timer expired.

If there's anything else I can share to help, please let me know.

anecdata commented 1 year ago

I just wanted to note that I have a case where I have two identical boards set up identically at the same time: erase flash first, then install .bin. One unit has way more safe modes during run than the other. It always has, but still happening when set up identically makes me wonder if there's something marginal in the flash or hardware.

dhalbert commented 1 year ago

@anecdata have you checked the 3.3V line to see if the voltages differ substantially?

And are they talking to the same host? Is one always started after another? Maybe one is stalling on a network operation?

bill88t commented 1 year ago

If the voltages are good, maybe try lowering psram speed and cpu freq temporarily to see if this goes away?

anecdata commented 1 year ago

Same host. Both at 3.28v. Reloads, resets, and safe modes occur at seemingly random times . code.py reloads for various exceptions and for a certain number of consecutive failures in wifi or network operations, and resets if there have been a certain number of consecutive reloads. Safe mode triggers on this particular project due to an intermittent bug in wifi monitor init that I haven't had time to dig in to deeper (need to file an issue). safemode.py resets it.

I can play with build params at some point, but I'm inclined to just swap out the "flaky" one for now to reduce downtime. I thought it was worth sharing that some safe modes may not be due to software.

dhalbert commented 1 year ago

@anecdata When you last looked at this, it seemed like it was board sample dependent, if I understand correctly. Should we leave this open?

anecdata commented 1 year ago

I meant only to note that in one instance, I had different apparent rates of safe mode between two otherwise identical boards.

The issue is still open, almost any non-trivial wifi / requests project I put together eventually goes into safe mode with recent 8.x.x (newer esp-idf is in there too, of course). I have a number of similar long-running 7.3.3 devices that never go into safe mode (though project code has evolved since then). It's difficult to characterize the issue since the safe modes are intermittent. I have since seen safe mode on requests-only projects with 8.1.0-beta.2 with no major complicating factors like http server, async, etc.

Further complicating things is the change in behavior with a DEBUG build, so we don't seem to be able to get an accurate debug backtrace.

It's also possible that the root cause for this is the same as other 8.x.x safe mode issues that have been filed.

jepler commented 1 year ago

Please re-test after idf 5 upgrade

anecdata commented 10 months ago

Closing. I think enough code has changed that fresh testing and new minimal example will be needed if this is still occurring.

adafruit / circuitpython