httpserver script eventually hardfaults

jepler commented 1 year ago

CircuitPython version

Adafruit CircuitPython 8.0.0 on 2023-02-06; Kaluga 1 with ESP32S2
Board ID:espressif_kaluga_1.3

Code/REPL

# SPDX-FileCopyrightText: Copyright (c) 2023 Jeff Epler for Adafruit Industries
#
# SPDX-License-Identifier: Unlicense

"""
This demo is designed for the Kaluga development kit version 1.3 with the
ILI9341 display. It requires CircuitPython 8.

This demo needs multiple settings properly configured in settings.toml:

CIRCUITPY_WIFI_SSID="..."
CIRCUITPY_WIFI_PASSWORD="..."
CIRCUITPY_RESERVED_PSRAM=1048576

Place these (replacing the "..." with correct values for your system)
in the file **CIRCUITPY/settings.toml** and restart.

After setting up the demo, connect to it using a web browser on the same local
network using the URL shown on the serial REPL.
"""

import os
import ssl
import struct

import board
import busio
import displayio
import espcamera
import espidf
import keypad
import socketpool
import wifi

from adafruit_httpserver.server import HTTPServer
from adafruit_httpserver.response import HTTPResponse

if espidf.get_reserved_psram() < 1047586:
    print("""Place the following line in CIRCUITPY/settings.toml, then hard-reset the board:
CIRCUITPY_RESERVED_PSRAM=1048576
""")
    raise SystemExit

print("Initializing camera")
cam = espcamera.Camera(
    data_pins=board.CAMERA_DATA,
    external_clock_pin=board.CAMERA_XCLK,
    pixel_clock_pin=board.CAMERA_PCLK,
    vsync_pin=board.CAMERA_VSYNC,
    href_pin=board.CAMERA_HREF,
    pixel_format=espcamera.PixelFormat.JPEG,
    frame_size=espcamera.FrameSize.VGA,
    i2c=board.I2C(),
    external_clock_frequency=20_000_000,
    framebuffer_count=1)
print("initialized")

pool = socketpool.SocketPool(wifi.radio)
server = HTTPServer(pool)

@server.route("/")
def base(request: HTTPRequest):
    with HTTPResponse(request, content_type="text/html") as response:
        response.send('''
camera image from circuitpython, refreshes ever 5 seconds<hr><img id=cam src="cam.jpg">
<script type="text/javascript">
window.setInterval(function() { document.getElementById("cam").src="cam.jpg?"+Date.now() }, 5000)
</script>
''')

@server.route("/cam.jpg")
def image(request: HTTPRequest):
    frame = cam.take(1)
    with HTTPResponse(request, content_type="image/jpeg") as response:
        response.send(frame)

print(f"Listening on http://{wifi.radio.ipv4_address}:8080")
server.serve_forever(str(wifi.radio.ipv4_address), 8080)

Behavior

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
code.py output:
Initializing camera
initialized
Listening on http://10.0.3.73:8080

Code stopped by auto-reload. Reloading soon.

[tio 16:07:52] Disconnected
[tio 16:07:59] Connected
Auto-reload is off.
Running in safe mode! Not running saved code.

You are in safe mode because:
CircuitPython core code crashed hard. Whoops!
Crash into the HardFault_Handler.
Please file an issue with the contents of your CIRCUITPY drive at 
https://github.com/adafruit/circuitpython/issues

Press any key to enter the REPL. Use CTRL-D to reload.

Description

No response

Additional information

No response

jepler commented 1 year ago

The demo is intended to be a 'webcam' that refreshes every 5 seconds.

I don't know if it's because of some wifi / camera interaction, or because of the repeated requests, or what.

deshipu commented 1 year ago

I have the httpserver streaming mjpeg from the camera on esp32s2 working, and didn't see any hardfaults. The only difference is that I never close the socket, since it's mjpeg.

deshipu commented 1 year ago

Oh, and no pwm for the master clock, I use an external crystal.

dhalbert commented 1 year ago

@jepler is it easy for you to retest this? I don't have a Kaluga, though I have an Adafruit OV5640 breakout.

FoamyGuy commented 1 year ago

I think I've seen this on a Feather S2 TFT as well during testing of the HTTPServer library upgrades. In my case there is no camera in use. I had neopixels and in one case a 14x4 segment featherwing.

I think I've seen it occur even when no other hardware is in the mix beyond just the httpserver, but I'm not 100% certain and much more of my testing did involve at least interacting with neopixels.

I believe the hard fault occurred after the server was left running for a few hours.

I will try to set up a minimal reproducer and catch some logs on the DBG pin.

FoamyGuy commented 1 year ago

I've managed to reproduce what I think may be the same error as this with a minimal HTTPServer test and capture the backtrace from it.

Here is the decoded backtrace:

❯ python tools/decode_backtrace.py adafruit_feather_esp32s2_tft
adafruit_feather_esp32s2_tft
? 0x4009a033:0x3ffdca00 0x4009d3bc:0x3ffdca20 0x3ffdca4d:0x3ffdca50 |<-CORRUPTED
0x4009a033: mp_obj_is_subclass_fast at /home/timc/repos/circuitpython/circuitpython/ports/espressif/../../py/objtype.c:1435
0x4009d3bc: mp_execute_bytecode at /home/timc/repos/circuitpython/circuitpython/ports/espressif/../../py/vm.c:1379
0x3ffdca4d: ?? ??:0
0x00000000: ?? ??:0

Looks like something caused corruption.

My test was run on a debug build from the main branch:

Adafruit CircuitPython 8.1.0-beta.2-31-g827eaeb1f-dirty on 2023-05-15; Adafruit Feather ESP32-S2 TFT with ESP32S2

Running the 'auto' simpletest from the httpserver library, although specifically the version from the open PR. https://github.com/michalpokusa/Adafruit_CircuitPython_HTTPServer/blob/4.0.0-examples-refactor-authentication-mimetypes/examples/httpserver_simpletest_auto.py I have been using this version on my device for testing and forgot to swap back to the released one before I started the test running.

I'm pretty sure that I have seen the same hardfault with the released versions though. I can also swap to that and start it back up and wait for another one to capture if that will be helpful.

Heres the full capture of info from the DBG pin at the time of the hardfault:

Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x4009a036  PS      : 0x00060930  A0      : 0x8009d3bf  A1      : 0x3ffdca00
A2      : 0x00000000  A3      : 0x3f00c8ac  A4      : 0x3ff7ea30  A5      : 0x3ffdbd18
A6      : 0x3ffe56b0  A7      : 0x3ffe5610  A8      : 0x00000000  A9      : 0x3ffe5490
A10     : 0x00000000  A11     : 0x000028f2  A12     : 0x3fd8c640  A13     : 0x00000004
A14     : 0x00000000  A15     : 0x00000009  SAR     : 0x00000011  EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000000  LBEG    : 0x3fd8c640  LEND    : 0x00000004  LCOUNT  : 0x4002cdc4

Backtrace: 0x4009a033:0x3ffdca00 0x4009d3bc:0x3ffdca20 0x3ffdca4d:0x3ffdca50 |<-CORRUPTED

I had the server script running and a client making GET requests to it every 10 seconds. There were 720 successfully completed GET requests before it hard faulted, so 7200 seconds, 120 minutes, or 2 hrs.

I believe that I have seen the hardfault even when there aren't requests being served. I can run another trial without the GET request client running if that could be helpful.

anecdata commented 1 year ago

Possibly related to https://github.com/adafruit/circuitpython/issues/7459. I had corrupted backtraces too. A rare one was uncorrupted, but DEBUG apparently masked the original root cause.

dhalbert commented 1 year ago

@FoamyGuy if you speed up the GET requests does it fail faster? Sounds reproducible but it would be nice to get it to happen faster. Are you trying it with a local server which won't complain if you speed things up?

FoamyGuy commented 1 year ago

@dhalbert speeding up the requests does not seem to appreciably speed up the hard fault occurrence. After your comment I started this running again with only a 1 second interval (10x faster than before) I'm not sure if it's actually keeping at 1 request per second exact, but it's up to 4718 completed requests already and no hard fault. I'll try to let it keep running until it does to see if there is anything different that can be gleaned from it's backtrace

FoamyGuy commented 1 year ago

I left it running last night. It completed 8781 requests (at ~1 per sec) and then hard faulted sometime before the next one could be completed. Speeding up the requests does not seem to influence it to hard fault faster unfortunately.

The output on DBG pin and decoded backtrace are the same as before.

The decoded backtrace:

❯ python tools/decode_backtrace.py adafruit_feather_esp32s2_tft
adafruit_feather_esp32s2_tft
? 0x4009a033:0x3ffdca00 0x4009d3bc:0x3ffdca20 0x3ffdca4d:0x3ffdca50 |<-CORRUPTED
0x4009a033: mp_obj_is_subclass_fast at /home/timc/repos/circuitpython/circuitpython/ports/espressif/../../py/objtype.c:1435
0x4009d3bc: mp_execute_bytecode at /home/timc/repos/circuitpython/circuitpython/ports/espressif/../../py/vm.c:1379
0x3ffdca4d: ?? ??:0
0x00000000: ?? ??:0

The full output from DBG pin:

Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x4009a036  PS      : 0x00060630  A0      : 0x8009d3bf  A1      : 0x3ffdca00  
A2      : 0x00000000  A3      : 0x3f00c8ac  A4      : 0x3ff7ea30  A5      : 0x3ffdbd18  
A6      : 0x3ffe56b0  A7      : 0x3ffe5610  A8      : 0x00000000  A9      : 0x3ffe5490  
A10     : 0x00000000  A11     : 0x00018b39  A12     : 0x3fd8c5c0  A13     : 0x00000002  
A14     : 0x00000000  A15     : 0x00000002  SAR     : 0x00000014  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000000  LBEG    : 0x3fd8c5c0  LEND    : 0x00000002  LCOUNT  : 0x4002cdc4  

Backtrace: 0x4009a033:0x3ffdca00 0x4009d3bc:0x3ffdca20 0x3ffdca4d:0x3ffdca50 |<-CORRUPTED

ELF file SHA256: 64fd8eb9386d6178

CPU halted.

bill88t commented 1 year ago

0x00000000: ?? ??:0

The fact it tried to run 0x0 is really sus. Like, how did this even happen? I would have to guess it moved to an incorrect mem address and treated it as a pointer, and it happened to be 0x0.

dhalbert commented 1 year ago

It is unclear to me whether this is due to updating the ESP-IDF or due to some CircuitPython core change from 7.x.

I was thinking of trying to pick up the latest ESP-IDF v4.4 changes, because as usual there are wifi fixes (some of which are in non-open-source code). But it's somewhat painful to do that. And for CircuitPython 9.0.0, we are going to try to switch to ESP-IDF V5 anyway.

We could try instrumenting the socket code and the Python more thoroughly to narrow down where the problem is. It's too bad it takes so long to reproduce. I thought increasing the rate would help, but it's mysterious that's not true.

You could try increasing the size of the response to something much larger (e.g. 1k or 10k characters).

I'm a bit confused: are you making HTTP or HTTPS requests on the server?

FoamyGuy commented 1 year ago

For these tests the server is running on the microcontroller and the client is a CPython script running on my PC using requests module to send the requests. All requests are HTTP only.

Yep, I can start it up with a larger response body.

dhalbert commented 1 year ago

Do you folks have an intuition that #7459 and #7582 are the same issue?

FoamyGuy commented 1 year ago

Do you folks have an intuition that #7459 and #7582 are the same issue?

It does seem possible to me that they are the same or related root cause.

I could be trying to read too much into it, but I noticed that some of the register dump values shown in the debug logs in that issue are similar (but not exact same) as some of the ones in my dumps. That could be an unimportant detail that I honed in on though.

It does sound like a similar time frame for reproduction and symptoms observed.

I did re-run the test with a larger body size in the return from the server. It ran for while with a ~10k hardcoded response without hard faulting. The client did eventually have an exception raised which stopped it from running, but the microcontroller itself kept on running in this instance.

After that stopped itself I changed the client loop code slightly to print out timestamps so it's easier to verify how long it ran. I also modified the server code to randomly generate the size of the responses by duplicating a specific part of a string a random number of times and including the result in the response. I'm not sure why I thought to try this but thought maybe different responses each time could help speed up reproduction.

That version has now hard faulted for the first time. But unfortunately it did still take a rather long time. It ran for 10025 successful requests at 1 per second being the target rate. It ran from 10:20:47 to 13:49:38

The responses were between ~16k and ~26k

The information printed to the DBG pin is the exact same as the prior hard faults I observed (same log as posted above.)

anecdata commented 1 year ago

It's hard to say. Could be the same root cause. Likely to be some IDF change? I haven't seen a safemode in over a week with several eligible devices running projects, but other times they have been more frequent. I do reload and reset liberally in code.py when various exceptions happen, so that could be preventing some safemodes. I've also backed away from some of the more complex projects that were causing the most trouble. I'll try to set up a few dedicated devices with test code and DEBUG to try to accelerate data collection on the issue.

jepler commented 1 year ago

re-test after esp-idf update in 9

tannewt commented 8 months ago

Please re-test this with a 9.0 beta release. If it is still an issue, then file a new issue referencing this one.

adafruit / circuitpython