Closed jepler closed 8 months ago
The demo is intended to be a 'webcam' that refreshes every 5 seconds.
I don't know if it's because of some wifi / camera interaction, or because of the repeated requests, or what.
I have the httpserver streaming mjpeg from the camera on esp32s2 working, and didn't see any hardfaults. The only difference is that I never close the socket, since it's mjpeg.
Oh, and no pwm for the master clock, I use an external crystal.
@jepler is it easy for you to retest this? I don't have a Kaluga, though I have an Adafruit OV5640 breakout.
I think I've seen this on a Feather S2 TFT as well during testing of the HTTPServer library upgrades. In my case there is no camera in use. I had neopixels and in one case a 14x4 segment featherwing.
I think I've seen it occur even when no other hardware is in the mix beyond just the httpserver, but I'm not 100% certain and much more of my testing did involve at least interacting with neopixels.
I believe the hard fault occurred after the server was left running for a few hours.
I will try to set up a minimal reproducer and catch some logs on the DBG pin.
I've managed to reproduce what I think may be the same error as this with a minimal HTTPServer test and capture the backtrace from it.
Here is the decoded backtrace:
❯ python tools/decode_backtrace.py adafruit_feather_esp32s2_tft
adafruit_feather_esp32s2_tft
? 0x4009a033:0x3ffdca00 0x4009d3bc:0x3ffdca20 0x3ffdca4d:0x3ffdca50 |<-CORRUPTED
0x4009a033: mp_obj_is_subclass_fast at /home/timc/repos/circuitpython/circuitpython/ports/espressif/../../py/objtype.c:1435
0x4009d3bc: mp_execute_bytecode at /home/timc/repos/circuitpython/circuitpython/ports/espressif/../../py/vm.c:1379
0x3ffdca4d: ?? ??:0
0x00000000: ?? ??:0
Looks like something caused corruption.
My test was run on a debug build from the main branch:
Adafruit CircuitPython 8.1.0-beta.2-31-g827eaeb1f-dirty on 2023-05-15; Adafruit Feather ESP32-S2 TFT with ESP32S2
Running the 'auto' simpletest from the httpserver library, although specifically the version from the open PR. https://github.com/michalpokusa/Adafruit_CircuitPython_HTTPServer/blob/4.0.0-examples-refactor-authentication-mimetypes/examples/httpserver_simpletest_auto.py I have been using this version on my device for testing and forgot to swap back to the released one before I started the test running.
I'm pretty sure that I have seen the same hardfault with the released versions though. I can also swap to that and start it back up and wait for another one to capture if that will be helpful.
Heres the full capture of info from the DBG pin at the time of the hardfault:
Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled.
Core 0 register dump:
PC : 0x4009a036 PS : 0x00060930 A0 : 0x8009d3bf A1 : 0x3ffdca00
A2 : 0x00000000 A3 : 0x3f00c8ac A4 : 0x3ff7ea30 A5 : 0x3ffdbd18
A6 : 0x3ffe56b0 A7 : 0x3ffe5610 A8 : 0x00000000 A9 : 0x3ffe5490
A10 : 0x00000000 A11 : 0x000028f2 A12 : 0x3fd8c640 A13 : 0x00000004
A14 : 0x00000000 A15 : 0x00000009 SAR : 0x00000011 EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000000 LBEG : 0x3fd8c640 LEND : 0x00000004 LCOUNT : 0x4002cdc4
Backtrace: 0x4009a033:0x3ffdca00 0x4009d3bc:0x3ffdca20 0x3ffdca4d:0x3ffdca50 |<-CORRUPTED
I had the server script running and a client making GET requests to it every 10 seconds. There were 720 successfully completed GET requests before it hard faulted, so 7200 seconds, 120 minutes, or 2 hrs.
I believe that I have seen the hardfault even when there aren't requests being served. I can run another trial without the GET request client running if that could be helpful.
Possibly related to https://github.com/adafruit/circuitpython/issues/7459. I had corrupted backtraces too. A rare one was uncorrupted, but DEBUG
apparently masked the original root cause.
@FoamyGuy if you speed up the GET requests does it fail faster? Sounds reproducible but it would be nice to get it to happen faster. Are you trying it with a local server which won't complain if you speed things up?
@dhalbert speeding up the requests does not seem to appreciably speed up the hard fault occurrence. After your comment I started this running again with only a 1 second interval (10x faster than before) I'm not sure if it's actually keeping at 1 request per second exact, but it's up to 4718 completed requests already and no hard fault. I'll try to let it keep running until it does to see if there is anything different that can be gleaned from it's backtrace
I left it running last night. It completed 8781 requests (at ~1 per sec) and then hard faulted sometime before the next one could be completed. Speeding up the requests does not seem to influence it to hard fault faster unfortunately.
The output on DBG pin and decoded backtrace are the same as before.
The decoded backtrace:
❯ python tools/decode_backtrace.py adafruit_feather_esp32s2_tft
adafruit_feather_esp32s2_tft
? 0x4009a033:0x3ffdca00 0x4009d3bc:0x3ffdca20 0x3ffdca4d:0x3ffdca50 |<-CORRUPTED
0x4009a033: mp_obj_is_subclass_fast at /home/timc/repos/circuitpython/circuitpython/ports/espressif/../../py/objtype.c:1435
0x4009d3bc: mp_execute_bytecode at /home/timc/repos/circuitpython/circuitpython/ports/espressif/../../py/vm.c:1379
0x3ffdca4d: ?? ??:0
0x00000000: ?? ??:0
The full output from DBG pin:
Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled.
Core 0 register dump:
PC : 0x4009a036 PS : 0x00060630 A0 : 0x8009d3bf A1 : 0x3ffdca00
A2 : 0x00000000 A3 : 0x3f00c8ac A4 : 0x3ff7ea30 A5 : 0x3ffdbd18
A6 : 0x3ffe56b0 A7 : 0x3ffe5610 A8 : 0x00000000 A9 : 0x3ffe5490
A10 : 0x00000000 A11 : 0x00018b39 A12 : 0x3fd8c5c0 A13 : 0x00000002
A14 : 0x00000000 A15 : 0x00000002 SAR : 0x00000014 EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000000 LBEG : 0x3fd8c5c0 LEND : 0x00000002 LCOUNT : 0x4002cdc4
Backtrace: 0x4009a033:0x3ffdca00 0x4009d3bc:0x3ffdca20 0x3ffdca4d:0x3ffdca50 |<-CORRUPTED
ELF file SHA256: 64fd8eb9386d6178
CPU halted.
0x00000000: ?? ??:0
The fact it tried to run 0x0 is really sus. Like, how did this even happen? I would have to guess it moved to an incorrect mem address and treated it as a pointer, and it happened to be 0x0.
It is unclear to me whether this is due to updating the ESP-IDF or due to some CircuitPython core change from 7.x.
I was thinking of trying to pick up the latest ESP-IDF v4.4 changes, because as usual there are wifi fixes (some of which are in non-open-source code). But it's somewhat painful to do that. And for CircuitPython 9.0.0, we are going to try to switch to ESP-IDF V5 anyway.
We could try instrumenting the socket code and the Python more thoroughly to narrow down where the problem is. It's too bad it takes so long to reproduce. I thought increasing the rate would help, but it's mysterious that's not true.
You could try increasing the size of the response to something much larger (e.g. 1k or 10k characters).
I'm a bit confused: are you making HTTP or HTTPS requests on the server?
For these tests the server is running on the microcontroller and the client is a CPython script running on my PC using requests
module to send the requests. All requests are HTTP only.
Yep, I can start it up with a larger response body.
Do you folks have an intuition that #7459 and #7582 are the same issue?
Do you folks have an intuition that #7459 and #7582 are the same issue?
It does seem possible to me that they are the same or related root cause.
I could be trying to read too much into it, but I noticed that some of the register dump values shown in the debug logs in that issue are similar (but not exact same) as some of the ones in my dumps. That could be an unimportant detail that I honed in on though.
It does sound like a similar time frame for reproduction and symptoms observed.
I did re-run the test with a larger body size in the return from the server. It ran for while with a ~10k hardcoded response without hard faulting. The client did eventually have an exception raised which stopped it from running, but the microcontroller itself kept on running in this instance.
After that stopped itself I changed the client loop code slightly to print out timestamps so it's easier to verify how long it ran. I also modified the server code to randomly generate the size of the responses by duplicating a specific part of a string a random number of times and including the result in the response. I'm not sure why I thought to try this but thought maybe different responses each time could help speed up reproduction.
That version has now hard faulted for the first time. But unfortunately it did still take a rather long time. It ran for 10025
successful requests at 1 per second being the target rate. It ran from 10:20:47
to 13:49:38
The responses were between ~16k and ~26k
The information printed to the DBG pin is the exact same as the prior hard faults I observed (same log as posted above.)
It's hard to say. Could be the same root cause. Likely to be some IDF change? I haven't seen a safemode in over a week with several eligible devices running projects, but other times they have been more frequent. I do reload and reset liberally in code.py when various exceptions happen, so that could be preventing some safemodes. I've also backed away from some of the more complex projects that were causing the most trouble. I'll try to set up a few dedicated devices with test code and DEBUG to try to accelerate data collection on the issue.
re-test after esp-idf update in 9
Please re-test this with a 9.0 beta release. If it is still an issue, then file a new issue referencing this one.
CircuitPython version
Code/REPL
Behavior
Description
No response
Additional information
No response