Open Timeline8 opened 4 weeks ago
adafruit_thermistor
is quite simple. It uses analogio.AnalogIn
. I'm thinking there is an ADC problem.
Additional observation this morning. Put the S3 TFT feather on to a breadboard so I could hook up physical thermistors. Only change to the code is the resistors I used are in a 1k SIP, so the setup code for the two thermistors changed from 10k to 1k. Crashed in the 194th loop with a divide by zero error that I have been seeing periodically. If I simply try a CTRL-D or a Save in MU to soft boot the code immediately crashes on the first loop. Repeated CTRL-D multiple times in a row to verify. Restarting the code via soft boot does not clear out of memory whatever is causing this failure. Hitting the boards RST button does restart the code properly where it went over 400 loops then went to safe mode with the internal watchdog timer expired error. I have pasted the REPL below of the div by zero message in the 194th loop and then a subsequent Save (soft boot) in MU.
I did notice on my S2 running this code (also have it running on a Pico W without issue), I was taking samples for my averaging function every 50ms rather than the 20ms posted here. Changed my S3 TFT feather to 50ms, but still crashes with the WD timer expired Safe Made message.
194 gc.mem_alloc()=5504
Average therm7 Reading : 23°C 75°F
Traceback (most recent call last):
File "code.py", line 40, in <module>
File "code.py", line 19, in get_average_temp
File "adafruit_thermistor.py", line 126, in temperature
File "adafruit_thermistor.py", line 116, in resistance
ZeroDivisionError: division by zero
Code done running.
Press any key to enter the REPL. Use CTRL-D to reload.
Adafruit CircuitPython 9.0.4 on 2024-04-16; Adafruit Feather ESP32-S3 TFT with ESP32S3
>>> [D
...
>>>
soft reboot
Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
code.py output:
1 gc.mem_alloc()=4768
Traceback (most recent call last):
File "code.py", line 35, in <module>
File "code.py", line 19, in get_average_temp
File "adafruit_thermistor.py", line 126, in temperature
File "adafruit_thermistor.py", line 116, in resistance
ZeroDivisionError: division by zero
Code done running.
The division by zero error is:
if self.high_side:
# Thermistor connected from analog input to high logic level.
reading = self.pin.value / 64
reading = (1023 * self.series_resistor) / reading
if the analog pin value is 0, then it divides by 0.
Which is what I thought might be happening when I was running with no hardware connected to the pins where random noise at the open pins could result in a zero reading. This is why I then made sure to add the thermistors & resistors this morning so there would be no way there should be 0V at the pin and I still experienced the divide by zero fault. It ran the loop 193 times with valid room temperature mid/upper 70°Fs readings for both thermistors then 3 seconds later on loop 194 read the first thermistor then failed reading the second thermistor. Doesn't seem like a hardware problem with my simple circuit
But that is kind of besides the point. I can not get this code to run for any length of time without some sort of crash on an ESP32-S3 where as the same code on an S2 and on a Pico W run none stop day in and day out for weeks now. With the S3 usually it is going into safe mode with an internal WD timer expired failure with the board disconnecting and sometimes reconnecting, sometimes the divide by zero, and still other times with the board disconnecting with the Neopixel going steady white but the code still running where I can see the measurements being reported every 3 seconds on the TFT but obviously the neopixel part of the code doing nothing anymore.
One last test. I removed my thermistors and replaced them with a second 1k SIP so that the A0 and A1 pins are getting a straight fixed resistor voltage divider verified with my meter that both pin are at 1.65V. Reran code as is (so temp reported is 94°C 211°F since I didn't change the thermistor line from 10k@25°C). It managed to run over a half hour before it bombed out with the "Internal watchdog timer expired." safe mode message. At least no divide by zero.
Just to be clear, you are getting things like the watchdog crash or divide by zero on 9.1.0-beta.3 as well as 9.0.x, right? We moved to a new version of ESP-IDF for 9.1.0, and I want to make sure that version still has the issue.
I am fairly sure I did but re-reading my tedious notes above, I see I don't have an entry stating that. However I did note I ran the beta in the forum discussion. But I will double check tonight as the Waveshare should still be running the 9.1.0-beta.3. On the Adafruit S3 TFT I only ran that one on 9.0.4 as I thought it might be the version on the two S2s I have running this code for months without issue but checking one of the S2s, it is at 9.0.3. I can upgrade the S3 TFT tonight to 9.1.0-beta.3 and retest it as well. And I got a notice yesterday that my backorder from Digikey for the QtPy S3 I ordered has shipped, so when I get it I will load 9.1.0-beta.3 on that one as well and see how it acts (my S3 REV TFT is sadly still on backorder). I will report everything I find.
Do you think there is any value loading 9.0.3 onto any of the S3 boards since the S2 boards run fine with it? Or do you think this is an S3 specific issue and we should only be looking upward and onward with current versions only?
Do you think there is any value loading 9.0.3 onto any of the S3 boards since the S2 boards run fine with it? Or do you think this is an S3 specific issue and we should only be looking upward and onward with current versions only?
Based on your testing, I think this is an S3-specific issue. S3 with 9.1.0-beta.3 is the only test still to do, I would say. If that's a fix, great, otherwise I will set up an S3 board and let it run for hours.
2024-06-04
Waveshare - verified it was still on the beta that I last ran it. From boot_out.txt… Adafruit CircuitPython 9.1.0-beta.3 on 2024-05-22; Waveshare ESP32-S3-Zero with ESP32S3
Running with TFT to see code running, using 2x 1k SIP packages to create voltage divider at pins D7 & D8 to simulate the thermistors and verified with a meter that each pin had 1.64V at each pin while running..
Looped just over 100 times (~ 5 minutes) before the board disconnected from my iMac, MU and MacOS both reported disk ejection, Neopixel changed to steady white, and board did not reconnect (MU serial window closed and icon show no board connected). However the TFT shows the code continuing to run normally and as I type this is at loop 180.
S3 TFT - Ran “circup update --all” first to get the suggested link for the newest version of CP and also updated the libraries. Downloaded the beta and loaded it. From boot_out.txt… Adafruit CircuitPython 9.1.0-beta.3 on 2024-05-22; Adafruit Feather ESP32-S3 TFT with ESP32S3
Hard reset board (pulled USB cable). Reduced code from the Waveshare because the external TFT code not required. Same 2x 1k SIP resistors to create resistor divider voltage at A0 and A1. Verified 1.57V at each pin while running.
Ran for 45+ minutes and last I checked it was over 700 loops. I came back a little later and it had disconnected, but had then reconnected and restarted. The loop was up to 170+ on the TFT. Neopixel was acting normally per the code.
CTRL-C, ejected board, power cycled it, and restarted. So far this second run it is behaving itself and has made it over an hour and at loop 1385.
And to add to the end of the last post, the S3 feather sometween about 11pm and 2am disconnected and reconnected 4 times and finally stopped in Safe Model with an "Internal watchdog timer expired".
On either board, do you have a settings.toml
with CIRCUITPY_WIFI_SSID
and CIRCUITPY_WIFI_PASSWORD
? That will connect to the wifi network. In other words, wifi will be active even if not mentioned in the test program.
It certainly sounds like the boards are hard-crashing resetting spontaneously. I will try a very simple test that simply reads as fast as possible from the ADC.
Yes, I still have the WiFi settings auto connecting like that. We did talk about disabling that over at the forums but I don't think I tried it. So that is now on my to-do list for tonight: Run both the Waveshare S3 and Feather TFT S3 without the .toml file.
If there is a conflict between the two (ADC and WiFi), that will be disappointing for me since I need both for my application, so I won't be able to simply not use WiFi. But the problem can't be fixed later if we don't narrow it down to a root cause, so I will try it sans WiFi.
Also just received my QtPy S3 today. So if the other two crash relatively quickly with the WiFi off, I can try the QtPy. While I wouldn't think the model board matters as they are all S3, I have noticed that the Waveshare is good about crashing sooner while the Feather likes to wait until later.
I was just testing on a QT Py ESP32-S3, with the very simple test program below. With CIRCUITPY_WIFI_SSID
and CIRCUITPY_WIFI_PASSWORD
commented out in settings.toml
, it runs indefinitely, with millions of conversions. With them not commented out (so web set up in advance) , it crashes hard almost immediately. I am using board.A2
, which uses ADC1 on the QT Py S3.
So now I know how to reproduce this. It is strange because ADC2 is supposed to be shared with WiFi, not ADC1. But maybe something is interrupting the conversion in some bad way.
No need for you to test further at this point. Thanks for persevering through this.
Test program with two 1kohm resistors forming a 3.3/2 voltage divider connected to pin A2:
import analogio
import board
a2 = analogio.AnalogIn(board.A2)
count = 0
while True:
count += 1
if count % 100000 == 0:
print("count", count)
v = a2.value
if v < 32000 or v > 33000:
print(count, v)
Testing with A0
, which is an ADC2 pin, I also get crashes rather quickly, and sometimes get safe-mode "Internal watchdog timer expired".
Hi Dan, Thank you very much for confirming this. Since I first reported this on the forums and after days of replies there and here, since no one seemed to be interesting in actually running the code to see if it could be reproduced, I was starting to wonder if it was me and everyone was just being polite by not telling me I'm the idiot. ;)
You and I actually had a conversation on the forums on the Wifi vs ADC about two months ago because I had read about the possible interference between the two and was concerned if I should be making a point to use ADC1 to avoid that. Regardless, it looks like your testing shows this is a different issue since you got the crash on both ADC1 & 2.
I guess for now I will have to proceed with my projects targeting S2 boards. The project I am slowly developing as I learn CP, and add features to as I go, depends on both ADC and WiFi for IO Feeds.
@Timeline8 Rest assured we were interested, but if we can delegate some testing to eliminate possibilities, then we try that. (We have all too many bugs to look at :slightly_smiling_face:). For instance, was displayio
involved or not? And I was hoping it was really fixed in 9.1.0-beta.3, since we'd upgraded the underlying Espressif software (ESP-IDF) in that release. I also spent time looking for similar reports in the ESP-IDF repo, but could not find any.
I hope we can figure this out soon, because broken ADC's when wifi is in use is a pretty serious limitation. As your testing indicates, the S2 boards could be a substitute. If you are interested in more precise temperature readings, then you could use external I2C ADC breakout board. If you are measuring ambient air temperature (not liquids), then one of the I2C temperature breakouts (there are many) could be used. But the easiest would be to fix ESP32-S3, of course.
I just ran into this with the cardputer. Spamming the adc with wifi connected produces instant resets, watchdog safemodes and hangups.
I just finished the battery driver, which uses IO10 of the ESP32-S3. When connected to a network, after about 30 seconds it does one of the following:
The polling rate was 30 samples/s.
The code accounts for None
reads.
Attempted this patch, which reduces the points of failure:
--- a/ports/espressif/common-hal/analogio/AnalogIn.c
+++ b/ports/espressif/common-hal/analogio/AnalogIn.c
@@ -115,10 +115,10 @@ uint16_t common_hal_analogio_analogin_get_value(analogio_analogin_obj_t *self) {
#endif
uint32_t adc_reading = 0;
- size_t sample_count = 0;
+ int sample_count = 0;
// Multisampling
esp_err_t ret = ESP_OK;
- for (int i = 0; i < NO_OF_SAMPLES; i++) {
+ while (sample_count < NO_OF_SAMPLES) {
int raw;
ret = adc_oneshot_read(adc_handle, channel, &raw);
if (ret != ESP_OK) {
@@ -127,9 +127,6 @@ uint16_t common_hal_analogio_analogin_get_value(analogio_analogin_obj_t *self) {
adc_reading += raw;
sample_count += 1;
}
- if (sample_count == 0) {
- raise_esp_error(ret);
- }
adc_reading /= sample_count;
// This corrects non-linear regions of the ADC range with a LUT, so it's a better reading than raw
It didn't work. I think the error is in esp-idf.
I will attempt to create a minimal esp-idf application implementing oneshot adc and wifi.
Yep, fun stuff:
I (21616) EXAMPLE: ADC1 Channel[3] Raw Data: 562
I (21616) EXAMPLE: ADC1 Channel[3] Cali Voltage: 481 mV
I (21626) EXAMPLE: ADC2 Channel[0] Raw Data: 537
E (21626) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time:
E (21626) task_wdt: - IDLE0 (CPU 0)
E (21626) task_wdt: Tasks currently running:
E (21626) task_wdt: CPU 0: main
E (21626) task_wdt: CPU 1: IDLE1
E (21626) task_wdt: Print CPU 0 (current core) backtrace
Backtrace: 0x4200B48F:0x3FC93680 0x4200B8AC:0x3FC936A0 0x40377331:0x3FC936D0 0x42007582:0x3FC992B0 0x42008447:0x3FC992E0 0x42006FB1:0x3FC99300 0x42006B22:0x3FC99320 0x4200F58A:0x3FC99340 0x4200EDF5:0x3FC99360 0x4200EE52:0x3FC99380 0x4200F491:0x3FC993B0 0x420134C3:0x3FC993E0 0x42012E86:0x3FC99400 0x4200F601:0x3FC99720 0x4201B431:0x3FC99750 0x403804C9:0x3FC99780 0x42008C26:0x3FC997D0 0x4201AC97:0x3FC99840 0x4037AD1D:0x3FC99870
0;32mI (21626) EXAMPLE: ADC2 Channel[0] Cali Voltage: 467 mV
I (21706) EXAMPLE: ADC1 Channel[2] Raw Data: 732
I (21716) EXAMPLE: ADC1 Channel[2] Cali Voltage: 620 mV
I (21716) EXAMPLE: ADC1 Channel[3] Raw Data: 543
Stock idf 5.2.1, untouched oneshot adc example.. I didn't even setup wifi in it.. The nvs partition may have connected it automatically, idk.
possibly related? https://github.com/espressif/esp-idf/issues/12466
Using a YD-ESP32-S3 which is a dual usb-C board for this, which is excellent for debugging and built some debug builds.
import wifi, board, analogio;wifi.radio.connect("SSID", "PASSWD");a=analogio.AnalogIn(board.GPIO10)
while True:
a.value
This reliably crashes it. Maximum 15s.
Decoded backtrace of debug build (clean tree, current master):
0x4037cc7a: ram_chip_i2c_readReg at ??:?
0x40378aa0: regi2c_ctrl_write_reg_mask at /home/bill88t/git/circuitpython/ports/espressif/esp-idf/components/esp_hw_support/regi2c_ctrl.c:46
0x420a6761: adc_ll_calibration_init at /home/bill88t/git/circuitpython/ports/espressif/esp-idf/components/hal/esp32s3/include/hal/adc_ll.h:790
(inlined by) adc_hal_calibration_init at /home/bill88t/git/circuitpython/ports/espressif/esp-idf/components/hal/adc_hal_common.c:92
0x420a09db: adc_oneshot_read at /home/bill88t/git/circuitpython/ports/espressif/esp-idf/components/esp_adc/adc_oneshot.c:174
0x42042937: common_hal_analogio_analogin_get_value at /home/bill88t/git/circuitpython/ports/espressif/common-hal/analogio/AnalogIn.c:123
0x42039455: analogio_analogin_obj_get_value at /home/bill88t/git/circuitpython/ports/espressif/../../shared-bindings/analogio/AnalogIn.c:101
0x42014f0a: fun_builtin_1_call at /home/bill88t/git/circuitpython/ports/espressif/../../py/objfun.c:68
0x4200eaf5: mp_call_function_n_kw at /home/bill88t/git/circuitpython/ports/espressif/../../py/runtime.c:725
0x4200ecdf: mp_convert_member_lookup at /home/bill88t/git/circuitpython/ports/espressif/../../py/runtime.c:1183
0x4200ee19: mp_load_method_maybe at /home/bill88t/git/circuitpython/ports/espressif/../../py/runtime.c:1253
0x4200ee2e: mp_load_method at /home/bill88t/git/circuitpython/ports/espressif/../../py/runtime.c:1262
0x4200eeed: mp_load_attr at /home/bill88t/git/circuitpython/ports/espressif/../../py/runtime.c:1071
0x4201f549: mp_execute_bytecode at /home/bill88t/git/circuitpython/ports/espressif/../../py/vm.c:437
0x420150e1: fun_bc_call at /home/bill88t/git/circuitpython/ports/espressif/../../py/objfun.c:273
0x4200eaf5: mp_call_function_n_kw at /home/bill88t/git/circuitpython/ports/espressif/../../py/runtime.c:725
0x4200eb0a: mp_call_function_0 at /home/bill88t/git/circuitpython/ports/espressif/../../py/runtime.c:699
0x4206e056: parse_compile_execute at /home/bill88t/git/circuitpython/ports/espressif/../../shared/runtime/pyexec.c:152
0x4206e45d: pyexec_friendly_repl at /home/bill88t/git/circuitpython/ports/espressif/../../shared/runtime/pyexec.c:748
0x4202491b: run_repl at /home/bill88t/git/circuitpython/ports/espressif/../../main.c:946
0x42024f67: main at /home/bill88t/git/circuitpython/ports/espressif/../../main.c:1084 (discriminator 1)
0x42026cca: app_main at /home/bill88t/git/circuitpython/ports/espressif/supervisor/port.c:503
0x4215a854: main_task at /home/bill88t/git/circuitpython/ports/espressif/esp-idf/components/freertos/app_startup.c:208
For this crash, the reason was: Guru Meditation Error: Core 1 panic'ed (Interrupt wdt timeout on CPU1).
possibly related? espressif/esp-idf#12466
I feel like it is. I think memory corruption takes place.
I only sometimes get a coredump. Sometimes usb just dies and debug serial, 5 seconds after usb has died, says that:
I (54623) wifi:bcn_timeout,ap_probe_send_start
W (54625) CP wifi: event 21 0x21
I (57131) wifi:ap_probe_send over, resett wifi status to disassoc
I (57132) wifi:state: run -> init (c800)
I (57133) wifi:pm stop, total sleep time: lu us / lu us
I (57136) wifi:new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
W (57145) CP wifi: disconnected
W (57146) CP wifi: reason 200 0xc8
I (57149) CP wifi: Retrying connect. 4 retries remaining
W (59593) CP wifi: disconnected
W (59593) CP wifi: reason 201 0xc9
As if it didn't crash.
@dhalbert, no problem on me doing some of the upfront testing. I know from my own experiences learning CircuitPython and also reading other people pleas for help, 99% of the time it is user error. So no harm on pushing back a bit on the user to kind of "prove it".
As for my project, it is for aquarium monitoring and eventually some controls later. Therefore I am sensing water temperature. Thermistors are well suited for this. Simple to use, and easy to waterproof if needed. I would love to see the build in ADC of the S3 back in track again, but the S2 is just as capable (don't need dual cores to read a temperature once every five minutes) so I still have a path forward.
@Timeline8: have you thought about using DS18B20-sensors? They are available in a waterproof enclosure. And they are easy to use.
I have good news and bad news.
Good news: I fixed the watchdog timer crash. Bad news: It's 2 bugs. It now 100% does the weird crash, where usb dies, but the core keeps running.
I don't have a backtrace to work with now. Also with my improvements adc make it 100x faster, crashing in less than a second.
I basically moved all the init and deinit stuff where it should be. The constant calibration data init was triggering the watchdog crash.
Doing the adc sampling into the core also doesn't fix this. There is nothing that can be done from the CircuitPython side to workaround this.
This needs to be fixed from ESP-IDF.
Here is my experimental patch with an updated ADC api that internalizes the sampling. (So that python doesn't loop over adc reads)
@bill88t Can you submit the state saving as a PR? Did you find any other reports of an ESP-IDF bug, in addition to what @jepler found?
No I did not find any more ESP-IDF bugs matching this. Though, I didn't spend much time on the issue tracker.
EDIT: WRONG (see below):
New very interesting discovery:
The quick crash on ADC1 pins only happens when the board is a 4MB flash / 2MB PSRAM board. I tried an 8/0 QT Py ESP32-S3 and it does not crash. @Timeline8's boards are all 4/2, and I was testing on a QT Py ESP32-S3 4/2 when I reproduced the problem.
Cardputer is 8/0 and crashes just fine? The YD-ESP32-S3 which is 16/8, also explodes to pieces.
Maybe some other factor is at play? Chip revision, temperature?
so the commonality is no PSRAM .. what if you make a no-psram build on one of the 4/2s, does the problem go away? if so, maybe there's some object that needs to be located in main RAM not PSRAM?
EDIT by @dhalbert: my testing was wrong
Cardputer is 8/0 and crashes just fine?
Are you testing an ADC1 or an ADC2 pin on the Cardputer?
I will do further testing on various boards tomorrow.
Are you testing an ADC1 or an ADC2 pin on the Cardputer?
BAT_ADC is wired to IO10 which is ADC1_CH9.
Leaving this here for quick reference.
My current working theory is that there is some critical period of time where wifi and adc may try to write to memory concurrently somehow.
Adc can trigger this scenario from the calibration data init and from .value
.
It may be that slower, /2 (esp32-s3r2) memories trigger this a lot more often.
My 4/2 vs 8/0 distinction was wrong. I had neglected to uncomment CIRCUITPY_WIFI_SSID
and CIRCUITPY_WIFI_PASSWORD
on the ESP32-S3 8/0 board. When I enabled initial wifi on the 8/0 board, it crashed in the same way. I edited the posts above to indicate that.
@bablokb , Yes I am aware of the DS18B20 but I just happen to have a LOT of thermistors on hand that are waterproof, so that is one reason. No matter how cheap I can find DS18B20s, no one is going to beat FREE. 😉 While the DS18B20 are fairly simple to use once you get them setup, even they are no match for the dead-simple lowly thermistor. Plus I have seen enough reports of less than reputable supplies (counterfeits?) of DS18B20 being on the market that gives me pause.
But fundamentally, the ADC on the S3 should work. I hate having to work around a function that is promoted as one of the features of a product only to find out it doesn't work. I have to imagine there must be a lot of users out there with various ADC application running S3 MCUs getting nuisance failures with no idea why. I hope it doesn't turn them off from ever using the ESP32 line again if they perceive them as unreliable.
Some more debugging:
Often USB disconnects, but the program is still running, if it doesn't hit a watchdog reset. I verified this by blinking the LED periodically and by writing a count to UART. Another interesting thing is that when it disconnects from USB, the UART output, both on the UART I'm writing to and the debug UART, become gibberish, as if the speed has changed.
This hasn't happened to me. And I have run it 100 times by now.
Memory corruption most certainly. I think it starts executing arbitrary code. Ofc producing different random results every time.
USB-JTAG would be a blessing if it worked right now. I still don't have a jtag dongle for normal jtag, so yea.. 👍
I think it starts executing arbitrary code.
That's not what I'm seeing (using a Metro ESP32-S3)
import analogio
import busio
import board
import digitalio
led = digitalio.DigitalInOut(board.LED)
led.switch_to_output()
analog = analogio.AnalogIn(board.A5)
uart = busio.UART(tx=board.TX, baudrate=115200)
count = 0
while True:
count += 1
if count % 100000 == 0:
print("count", count)
if count % 100 == 0:
led.value = not led.value
uart.write(str(count).encode())
uart.write(b' ')
v = analog.value
The LED keeps blinking sometimes, other times it eventually restarts in safe mode.
With my example:
while True:
adc.value
It either halts or safemodes. I should note, I only tested from REPL.
If proc'ed by my os and the "halt" scenario occurs, the neopixel, instead of heartbeating in 2 different blues, it becomes 255-value white (not the REPL white). USB in this scenario is disconnected.
Also I have already determined the watchdog timeout is cause of the calibration init functions. Under the adc pr, it doesn't happen, since those only run once during init.
If proc'ed by my os and the "halt" scenario occurs, the neopixel, instead of heartbeating in 2 different blues, it becomes 255-value white (not the REPL white). USB in this scenario is disconnected.
I would see this mode when the failure was not reporting an internal watchdog timer expired. When I had the TFT display hooked up or running my S3 TFT Feather, the board disconnects from USB, the Neopixel turns solid white, and no reconnection of the board to USB. However the TFT shows the code continuing to run and display my thermistor based temperatures. Note that in my code I was cycling through Neopixel colors 3 times between thermistor readings so it is interesting that the crash takes over the Neopixel leaving it white but otherwise the code is still running.
Fails on 8.2.7 as well.
ESP32-S3 TFT Feather example:
from time import sleep
import board, analogio, digitalio
from neopixel_write import neopixel_write
import pwmio
state_a = bytearray([0, 8, 0])
state_b = bytearray([0, 0, 8])
adc = analogio.AnalogIn(board.D10)
pwm = pwmio.PWMOut(board.A4)
pwm.duty_cycle = 32767 # 50%
nx = digitalio.DigitalInOut(board.NEOPIXEL)
nxp = digitalio.DigitalInOut(board.NEOPIXEL_POWER)
nx.switch_to_output()
nxp.switch_to_output()
nxp.value = 1
print("Ripping!")
while True:
neopixel_write(nx, state_a)
for _ in range(30):
adc.value
neopixel_write(nx, state_b)
This example produces a 500Hz signal on board.A4
.
When the "halt" white-neopixel scenario occurs, the pwm signal frequency changes from 500Hz to ~192Hz indicating that the system clock breaks down.
The duty cycle somehow manages to remain at 50%.
This happens regardless of the power source, be it USB or Li-Po.
This at the very least confirms a few things:
From reading: https://docs.espressif.com/projects/esp-idf/en/v5.2.2/esp32/api-reference/peripherals/adc_oneshot.html and https://docs.espressif.com/projects/esp-idf/en/v5.2.2/esp32/api-reference/peripherals/clk_tree.html we can tell some clock source is used for adc, but nowhere I can find on the idf does it actually touch the clocks.
I read components/esp_adc/adc_continuous.c
and components/esp_adc/adc_oneshot.c
.
I think it may be best to switch to the continuous api.
In the same style as it's now, initializing the ADC during get_value
and getting a bunch of values.
A few more tests:
CONFIG_ESP_DEFAULT_CPU_FREQ_MHZ_160=y
instead of 240 MHz: no help
CONFIG_ADC_ONESHOT_CTRL_FUNC_IN_IRAM=y
: no help
CONFIG_FREERTOS_UNICORE=y
, so that only one core is used: problem seems to go away. That would explain why ESP32-S2,which has only one core, doesn't have the problem.
Then this also means that it's not some lock-not-being-updated issue, but that whatever function runs on the other core is not thread safe and causes the crash.
And that "whatever function" is most certainly a wifi function. (Since this happens when connected) Somewhere an rtc_spinlock is missing.
I went through all the files containing the string rtc_spinlock
and extended the range of the critical code.
I also added it in a lot of places.
For the moment I assume all the spinlocks work as intended, across cores. All of the test were performed with clean builds.
Nothing fixed it.
Not even switching to portENTER_CRITICAL_SAFE
.
No wifi code explicitly spinlocks rtc, but instead leaves components/esp_hw_support/adc_share_hw_ctrl.c
to do the adc stuff.
I don't think I can find the actual fix.
I will test how viable is CONFIG_FREERTOS_UNICORE=y
.
CONFIG_FREERTOS_UNICORE
has literally 0 downsides as far as I can tell.
Measured a 1.1% perf difference on a division benchmark, with the unicore build being the faster one. There was no other differences across the builds. (Other than that indeed, the crashes went away.)
So I think, at least temporarily, we should go with that, and open an upstream ESP-IDF issue to have it resolved properly. (I can go do it if you want me to.)
Here are some other apparently related issues. These are about running the ADC tight loops, it appears, that cause watchdog errors. The original test in this issue does 20ms waits between reads, but still has problems. https://github.com/espressif/esp-idf/issues/12466 (as mentioned by @jepler) https://github.com/espressif/esp-idf/issues/8753 https://github.com/espressif/arduino-esp32/issues/6549 (same author as 8753)
I think we may have good news! I tested the Metro ESP32-S3 build from PR #9325 (Espressif BLE), but upgrades ESP-IDF to 5.2.2. My test program no longer fails. I put back 9.0.5 and it fails again as usual. This is very encouraging!
There isn't something completely obvious in the ESP-IDF release notes that points out a fix for this specific problem, but I'll take it :slightly_smiling_face: .
This fixes the temperature thing too. It's a complete fix as far as I can tell. Will leave my cardputer running watch -n 0.01 sensors
all day.
I merged it downstream to the adc_paranoia
branch and it played nicely with massive sample sizes too (65500 samples per poll).
Will this be fixed under the next beta release? 9.1.0-beta.4 or 9.1.1 or whatever the next release will be?
You can download the "Absolute Newest" build from the downloads page for your board. That will have the fix, along with any future 9.1.x releases.
CircuitPython version
Code/REPL
Behavior
Various failures but usually crashes share in common: MU pops up “Could not find an attached drive”, Mac OS pops up “Disk Not Ejected Properly”, MU of course has closed the serial window so nothing to see. Printing gc.mem_allocat() with each loop in my code shows allocated memorial in the 4000-8000 range so no apparent run away memory issues.
Sometimes the board will disconnect, come back, code stays running but the Neopixel is steady white like it is in the REPL. Other times it crashed with 3x yellow blinking (Safe mode) and reports an internal watchdog timer expired.
I have an S2 board that is on 9.0.4 and has been running this code for many weeks and sending the data to an IO feed. No chronic crashed like the S3 boards.
Description
What follows is the long list of notes I have been taking as I tried different things. But the above, in behavior, is the executive summary. Below is tedious reading. Sorry...
Testing notes:
Waveshare ESP32-S3 Zero running 9.0.5 and libraries updated via Circup is starting with the “code chooser” code discussed here https://forums.adafruit.com/viewtopic.php?t=210926 starting with the 6th post down.
Code I am running (“choosing”) is a dual thermistor reading in a roughly 3+ second long loop that reads two thermistors and then changes the color of the Neopixel 3 times once per second.
Crashes share in common: MU pops up “Count not find an attached drive”, Mac OS pops up “Disk Not Ejected Properly”, MU of course has closed the serial window so nothing to see. Printing gc.mem_allocat() with each loop in my code shows allocated memorial in the 4000-8000 range so no apparent run away memory issues.
First time I had no display configured so I could not see an error and the Neopixel wasn’t indicating any activity. Added code to display serial window on external display. Reset board with reset button. I failed to note if the drive had reloaded itself after the crash and before reseting the board.
Second time same “crash”. Observed the Neopixel to be constant white indicating it was in the REPL, however the external display showed the code was still running and getting valid thermistor readings. Crashed happened around 290-300 loops. CIRCUITPU remounted its drive I believe but not certain. I let it run for a while longer then reset the board with the reset button.
Third time crashed at loop 272, this time stopping and Neopixel flashing yellow in three blink bursts (safe mode). Reopening MU serial window, failed due to ”Internal watchdog timer expired.” Noted for sure that the CIRCUITPY drive had remounted. Ejected drive and power cycled the board by unplugging USB cable.
Fourth time crashed at loop 233 (gc.mem_alloc at 5568). Same as third run with code stopped, three yellow flashes, and “Internal watchdog timer expired” in the reopened MU serial window.
Switching gears… Renamed the “code chooser” program from code.py and made my thermistor code code.py so it will load and run directly without the chooser reseting the MCU. Also power cycle reset the board.
Different type of crash this time. At loop 53 (mem = 5168). Drive did not unmount and the error in the REPL is
Traceback (most recent call last): File "code.py", line 66, in
File "code.py", line 46, in get_average_temp
File "adafruit_thermistor.py", line 126, in temperature
File "adafruit_thermistor.py", line 116, in resistance
ZeroDivisionError: division by zero
Odd. Normally I use 10k resistors with my 10k thermistor but this time I only had 1k resistors on hand. But I wouldn’t think that should matter. Source code for the library doesn’t indicate any restrictions on the resistor range. I believe this failure is just a result of random values when no thermistor is attached.
Ran again and it made it to run 72 but same divide by zero error. Switched to 10k resistors. Hard reset. Made it to run 38, with the previously described crash scenario (disk eject & reconnect, safe mode with an “Internal watchdog timer expired” error) is back. Done for the night!
Next day. Backed up entire Waveshare CIRCUITPY drive. Ran one more time as is. Crashed with the Neopixel showing steady white (REPL indicator) but code was still running. MU and Mac OS both reported drive ejected. Board did not remount and MU doesn’t see it.
Adafruit REV TFT S2 Feather. Copied over all the files that were on the Waveshare. Also verified 9.0.5 and ran Circus to verify all libraries were up to date (all were). Commented out all code that had anything to do with the external display. Thermistors on breadboard changed from D6 and D7 to A0 and A1. No failures after a few hours.
Switched back to Waveshare and ran as is. Eventually failed with the REPL white neopixel, ejected disk, but kept running. Drive did not remount. Did full reinstall of boot loader then 9.0.5. Copied over backed up files onto the MCU again. Hard power cycle reset. Restarted code. Crashed at cycle 288, 3 yellow blink safe mode and “Internal watchdog timer expired” and drive remounted.
Commented out all thermistor stuff and just ran the neopixel and gc memory allocation. Ran 13908 loops without issue (over 12 hours). Uncommented thermistor code and restarted the run (hard reset). Made it 457 loops (a little over 20 minutes) and crashed with the board disconnecting and the TFT fade to black and back in about 3 second pulses.
Restarted as is after getting home from work. Got to about 275, white NeoPixel, still running code, and disconnected. Moved it to a power supply connection only (not computer) and restarted. Looks like it crashed the same way with white Neopixel and code still displaying new lines.
Copied same drive contents to S3 TFT Feather running 9.0.4 and started it on the computer (no thermistors connected). S3 TFT Feather crashed, disconnected, reconnected and reports Safe Mode for Internal Watchdog timer expired. Restarted S3 TFT Feather. Dies same way.
Additional information
No response