Open chilipeppr opened 1 year ago
Wemos S2 Mini?
Yes.
On Tue, May 2, 2023 at 10:20 AM anecdata @.***> wrote:
Wemos S2 Mini?
— Reply to this email directly, view it on GitHub https://github.com/adafruit/circuitpython/issues/7926#issuecomment-1531666182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4J23IG7DGU33NNYXHT5NTXEEQ5FANCNFSM6AAAAAAXTAC3WM . You are receiving this because you authored the thread.Message ID: @.***>
I increased the number of asyncio tasks to 30 and started your test program on a UM Feather S2 and an Adafruit Feather ESP32-S2 about 13 hours ago. They haven't crashed yet but I'll let them run over night. I'm assuming that adding the extra tasks didn't mess the test up. Unfortunately, I don't have a Wemos S2 Mini to test on, hopefully it's not hardware or port specific.
I did notice that the Lolin S2 boards are among the relatively few S2 boards that have CIRCUITPY_ESP_FLASH_FREQ set to 80m in mpconfigboard.mk. Your test script doesn't access flash so that's probably not relevant though.
I was actually using a Lolin S2 Mini, not a Wemos S2 Mini, but aren't they basically the same thing?
I've been running this even further since I posted and I did get a HARD_FAULT again a couple hours ago. Here's my log below. I also seem to sometimes get the WATCHDOG as well as a reason for a reboot. So my period between HARD_FAULTs is about 22 hours.
I'm building a Marble Run for the local school's STEM lab to inspire the kids and this thing needs to run 24x7 so the kids can hit the button whenever they want all day to launch the marbles down a huge track. So was noticing the hard faults on an ongoing basis and initially thought it was my code, but it really seems like it's just the OS doing it.
From safemode.py. Kill switch off. 2023-05-01 21:45:06, supervisor.RunReason.STARTUP, microcontroller.ResetReason.SOFTWARE, supervisor.SafeModeReason.HARD_FAULT
From main.py. Kill switch off. 2023-05-01 21:44:54, supervisor.RunReason.STARTUP, microcontroller.ResetReason.SOFTWARE, supervisor.SafeModeReason.NONE
From safemode.py. Kill switch off. 2023-05-02 00:18:08, supervisor.RunReason.STARTUP, microcontroller.ResetReason.SOFTWARE, supervisor.SafeModeReason.HARD_FAULT
From main.py. Kill switch off. 2023-05-02 00:18:03, supervisor.RunReason.STARTUP, microcontroller.ResetReason.SOFTWARE, supervisor.SafeModeReason.NONE
From main.py. Kill switch off. 2023-05-02 06:07:34, supervisor.RunReason.AUTO_RELOAD, microcontroller.ResetReason.SOFTWARE, supervisor.SafeModeReason.NONE
From main.py. Kill switch off. 2023-05-02 06:20:17, supervisor.RunReason.REPL_RELOAD, microcontroller.ResetReason.SOFTWARE, supervisor.SafeModeReason.NONE
From safemode.py. Kill switch off. 2023-05-02 11:00:43, supervisor.RunReason.STARTUP, microcontroller.ResetReason.WATCHDOG, supervisor.SafeModeReason.WATCHDOG
From main.py. Kill switch off. 2023-05-02 10:59:46, supervisor.RunReason.STARTUP, microcontroller.ResetReason.SOFTWARE, supervisor.SafeModeReason.NONE
From safemode.py. Kill switch off. 2023-05-02 18:22:22, supervisor.RunReason.STARTUP, microcontroller.ResetReason.SOFTWARE, supervisor.SafeModeReason.HARD_FAULT
From main.py. Kill switch off. 2023-05-02 18:23:10, supervisor.RunReason.STARTUP, microcontroller.ResetReason.SOFTWARE, supervisor.SafeModeReason.NONE
Yea, I think they are the same board. Hopefully I'll see a crash by morning, If not, maybe I'll try loading the Lolin firmware on one of them and see if I can reproduce the crash that way.
Here's my bout_out.txt, but keep in mind I saw the same thing on 8.0.5.
Adafruit CircuitPython 8.1.0-beta.1 on 2023-03-30; S2Mini with ESP32S2-S2FN4R2
Board ID:lolin_s2_mini
UID:487F307D7D25
I had a bit of a glitch overnight last night which killed my terminal sessions, so from yesterday's test runs all I know is that neither board crashed after about 13 hours.
I started the tests up again this morning but this time, I loaded the Lolin S2 mini firmware up on one of the boards first. The test program has been running again for about 13 hours. I'll check them again in the morning and hopefully at least one of them will have hard faulted.
Sounds good. Yeah, I left mine running today again too, but I actually had accidentally left it in the REPL, so missed out on my test today as well.
On Wed, May 3, 2023 at 9:51 PM RetiredWizard @.***> wrote:
I had a bit of a glitch overnight last night which killed my terminal sessions, so from yesterday's test runs all I know is that neither board crashed after about 13 hours.
I started the tests up again this morning but this time, I loaded the Lolin S2 mini firmware up on one of the boards first. The test program has been running again for about 13 hours. I'll check them again in the morning and hopefully at least one of them will have hard faulted.
— Reply to this email directly, view it on GitHub https://github.com/adafruit/circuitpython/issues/7926#issuecomment-1534010944, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4J23J3G4N7TAWEK7TWTF3XEMKUPANCNFSM6AAAAAAXTAC3WM . You are receiving this because you authored the thread.Message ID: @.***>
Had a HARD_FAULT again last night.
The timestamp is odd. Safemode got hit at 3:33AM. Then it rebooted into normal mode which should be about 10 seconds later, but the clock says 3:27AM. I do the RTC lookup when booting in normal mode, so it just makes me think the clock runs fast on the ESP32. Is it possible a fast clock throws stuff off over time and that's what causes a hard fault?
From safemode.py. Kill switch off. 2023-05-04 03:33:58, supervisor.RunReason.STARTUP, microcontroller.ResetReason.SOFTWARE, supervisor.SafeModeReason.HARD_FAULT
From main.py. Kill switch off. 2023-05-04 03:27:43, supervisor.RunReason.STARTUP, microcontroller.ResetReason.SOFTWARE, supervisor.SafeModeReason.NONE
No luck for me, 25 hours, 2 boards, 30 asyncio loops and no faults yet. I'll keep them running but I'm wondering if it's something with the specific board. Do you have just one of the Lolin boards?
I've gone ahead and ordered one which should be here in about 10 days but I also seem to remember that there was an issue with different manufacturers of these boards using different parts that behaved differently. Hopefully I've just been lucky and I'll be able to reproduce this on one of these s2 boards and then eventually do some debugging. 🍀
I have tested this on 2 different Lolin S2 Mini's and same issue, but I could try this on other ESP32's now that you mention it could be specific to the version of the chip they used.
On Thu, May 4, 2023 at 10:16 AM RetiredWizard @.***> wrote:
No luck for me, 25 hours, 2 boards, 30 asyncio loops and no faults yet. I'll keep them running but I'm wondering if it's something with the specific board. Do you have just one of the Lolin boards?
I've gone ahead and ordered one which should be here in about 10 days but I also seem to remember that there was an issue with different manufacturers of these boards using different parts that behaved differently. Hopefully I've just been lucky and I'll be able to reproduce this on one of these s2 boards and then eventually do some debugging. 🍀
— Reply to this email directly, view it on GitHub https://github.com/adafruit/circuitpython/issues/7926#issuecomment-1534962766, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4J23N4HXVBUOT4Z4Y6ME3XEPB37ANCNFSM6AAAAAAXTAC3WM . You are receiving this because you authored the thread.Message ID: @.***>
Do you think it has anything to do with me activating the display?
On Thu, May 4, 2023 at 10:17 AM John Lauer @.***> wrote:
I have tested this on 2 different Lolin S2 Mini's and same issue, but I could try this on other ESP32's now that you mention it could be specific to the version of the chip they used.
On Thu, May 4, 2023 at 10:16 AM RetiredWizard @.***> wrote:
No luck for me, 25 hours, 2 boards, 30 asyncio loops and no faults yet. I'll keep them running but I'm wondering if it's something with the specific board. Do you have just one of the Lolin boards?
I've gone ahead and ordered one which should be here in about 10 days but I also seem to remember that there was an issue with different manufacturers of these boards using different parts that behaved differently. Hopefully I've just been lucky and I'll be able to reproduce this on one of these s2 boards and then eventually do some debugging. 🍀
— Reply to this email directly, view it on GitHub https://github.com/adafruit/circuitpython/issues/7926#issuecomment-1534962766, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4J23N4HXVBUOT4Z4Y6ME3XEPB37ANCNFSM6AAAAAAXTAC3WM . You are receiving this because you authored the thread.Message ID: @.***>
Well I'm not doing anything with the display, I'd suggest you run the same test you have posted above. I don't think just having the display physically attached should be an issue if the software doesn't address it.
Another thought, how are you powering the board?
I do have the display regurgitating the standard output, so code would be executing in those display classes. I just commented out that part of the code so I'm ONLY testing the async tasks. If this doesn't crash on me then it would have to be the display causing this. If it does crash, one other idea is that I am using the Web Workflow to see the serial output. Perhaps it's the Wifi classes causing it.
As for powering the board, I'm just using a normal USB-C wall wart for a Raspberry Pi that's powering the ESP32-S2, so plenty of amps coming out of that power supply as it's a 3.5A one I had lying around.
I'll let you know what I see over the next 24 hours.
On Thu, May 4, 2023 at 10:20 AM RetiredWizard @.***> wrote:
Well I'm not doing anything with the display, I'd suggest you run the same test you have posted above. I don't think just having the display physically attached should be an issue if the software doesn't address it.
Another thought, how are you powering the board?
— Reply to this email directly, view it on GitHub https://github.com/adafruit/circuitpython/issues/7926#issuecomment-1534969937, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4J23N3LC2ZIPOK5G37WR3XEPCKNANCNFSM6AAAAAAXTAC3WM . You are receiving this because you authored the thread.Message ID: @.***>
I've switched my monitoring to the web workflow serial terminal as well....
The UM Feather S2 has now been running 52 hours without crashing. Somewhere between 25 and 52 hours the Adafruit Feather S2 with the Lolin S2 Mini firmware crashed but I didn't capture any information as a terminal wasn't connected at the time. Before I build a debug image and try and capture a traceback, maybe I'll try a Lolin build with CIRCUITPY_ESP_FLASH_FREQ set to 40m.
I had another crash last night. I did not run any display code. So, this was a clean run of just the asyncio.
On Fri, May 5, 2023 at 12:07 PM RetiredWizard @.***> wrote:
The UM Feather S3 has now been running 52 hours without crashing. Somewhere between 25 and 52 hours the Adafruit Feather S2 with the Lolin S2 Mini firmware crashed but I didn't capture any information as a terminal wasn't connected at the time. Before I build a debug image and try and capture a traceback, maybe I'll try a Lolin build with CIRCUITPY_ESP_FLASH_FREQ set to 40m.
— Reply to this email directly, view it on GitHub https://github.com/adafruit/circuitpython/issues/7926#issuecomment-1536543173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4J23LOQJWPZNL5AUE73WTXEUXULANCNFSM6AAAAAAXTAC3WM . You are receiving this because you authored the thread.Message ID: @.***>
The UM Feather S2 has now been running for 6 days without crashing. I restarted the Adafruit Feather S2 running the Lolin firmware 3 days ago and it hasn't crashed either.
The only time I've seen a crash is when the terminal has been disconnected so I decided to use your logging routine and see if I could recreate the crashes without the serial terminal session. It turns out your code won't run under 8.0.5 because the safemode.py feature isn't implemented in the 8.0.5 line yet.
I'll rebuild my tests using the 8.1.0 build and see if I can reproduce the issue there.
Ok, that's promising to me actually. It means there is not something deep in the bowels of CircuitPython or ESP-IDF causing this. Maybe I should just change my board and then I won't have crashes anymore. Or perhaps I'll try to slow down that frequency on the Flash chip like you were commenting on a while ago.
On Mon, May 8, 2023 at 10:23 AM RetiredWizard @.***> wrote:
The UM Feather S2 has now been running for 6 days without crashing. I restarted the Adafruit Feather S2 running the Lolin firmware 3 days ago and it hasn't crashed either.
The only time I've seen a crash is when the terminal has been disconnected so I decided to use your logging routine and see if I could recreate the crashes without the serial terminal session. It turns out your code won't run under 8.0.5 because the safemode.py feature isn't implemented in the 8.0.5 line yet.
I'll rebuild my tests using the 8.1.0 build and see if I can reproduce the issue there.
— Reply to this email directly, view it on GitHub https://github.com/adafruit/circuitpython/issues/7926#issuecomment-1538561096, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4J23PIBTKJSDPKS65QT4LXFEFVNANCNFSM6AAAAAAXTAC3WM . You are receiving this because you authored the thread.Message ID: @.***>
I've been running my two S2 boards for another couple days without any luck reproducing the issue.
The Lolin board I ordered came in, unfortunately, it turned out to be one of the likely counterfeit devices based on the silk screen and lack of PS RAM. I was able to build a custom CircuitPython binary which started the REPL, however I couldn't run the test script for more than 15 to 20 minutes and the REPL/Web Access interface would periodically hang for short periods of time.
I ordered the board through Walmart so it should be easy to return but I'm not sure it's worth re-ordering another board directly from China as it will take over a month to get here.
Interesting. I wonder if mine are counterfeit as well. I got them on Amazon, so not really sure. Either way, I was able to just deal with this problem by auto-rebooting on crash, having it run safemode.py, logging the error, doing another reboot to get into main.py and then proceed from there. I still see a reboot roughly every 12 hours or so, but it hasn't been a problem for my Marble Run project for a STEM lab at a high school. The kids are having plenty of fun with the final working circuit board that drives the marble elevator. So, all is good for now with the workarounds!
On Fri, May 12, 2023 at 9:51 AM RetiredWizard @.***> wrote:
I've been running my two S2 boards for another couple days without any luck reproducing the issue.
The Lolin board I ordered came in, unfortunately, it turned out to be one of the likely counterfeit https://forums.adafruit.com/viewtopic.php?t=197737 devices based on the silk screen and lack of PS RAM. I was able to build a custom CircuitPython binary which started the REPL, however I couldn't run the test script for more than 15 to 20 minutes and the REPL/Web Access interface would periodically hang for short periods of time.
I ordered the board through Walmart so it should be easy to return but I'm not sure it's worth re-ordering another board directly from China as it will take over a month to get here.
— Reply to this email directly, view it on GitHub https://github.com/adafruit/circuitpython/issues/7926#issuecomment-1545866898, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4J23PUF63K5423HAYM7XDXFZE5PANCNFSM6AAAAAAXTAC3WM . You are receiving this because you authored the thread.Message ID: @.***>
I'm glad you got something working :grin:. If you decide to test at the 40M flash speed let me know how it goes, but it looks to me like the issue is specific to the Lolin board so I don't think I can do much more at this point.
I doubt your boards are the counterfeit as the standard CircuitPython UF2s won't boot on a counterfeit board.
CircuitPython version
Code/REPL
Behavior
You get a standard output roughly every 10 seconds as the script runs.
Eventually you'll get a hard crash at random intervals. Usually this is around 6 to 12 hours. It happens more often if you add more tasks.
To track these hard faults I have safemode.py reboot the device back into normal operation, but log the reboot. I also log all normal boots from main.py. Then I dish control over to the actual code posted at the top of this bug report.
main.py
safemode.py
reboot/r.py
And this is the super simple killswitch.py which you really don't need, but in case you read the code above you'd want to see this basic class.
Description
No response
Additional information
No response