adafruit / circuitpython

CircuitPython - a Python implementation for teaching coding with microcontrollers
https://circuitpython.org
Other
4.09k stars 1.21k forks source link

Feather ESP32 V2: Internal watchdog timer expired after switching to CP 9.0.0 #9066

Closed vladak closed 5 months ago

vladak commented 7 months ago

CircuitPython version

Adafruit CircuitPython 9.0.0-rc.0-7-g6a07412b07 on 2024-03-11; Adafruit Feather ESP32 V2 with ESP32

This also happens with CP 9.0.0 final release.

Code/REPL

see https://github.com/vladak/shield

Behavior

I have a simple setup with Feather ESP32 V2, running on battery power, periodically reading metrics from temperature/humidity sensors, publishing them to local MQTT broker and entering deep sleep. The Feather was running CP 8, however in order to make the library installations easier with circup, I upgraded to CP 9 and since then the board enters safe mode after running for half a day or so. This was not happening with CP 8 - the board could run for months without a problem. I can see that especially at night because it continues blinking green three times. When I connect console over USB I can see this:

Adafruit CircuitPython 9.0.0-rc.0-7-g6a07412b07 on 2024-03-11; Adafruit Feather ESP32 V2 with ESP32
>>> 
soft reboot

Auto-reload is off.
Running in safe mode! Not running saved code.

You are in safe mode because:
Internal watchdog timer expired.
Press reset to exit safe mode.

Press any key to enter the REPL. Use CTRL-D to reload.

I have to reset the board after this happens.

Normally the code produces messages like this:

rst:0x5 (DEEPSLEEP_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 271414342, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0030,len:312
load:0x40078000,len:14168
load:0x40080400,len:4
load:0x40080404,len:3268
entry 0x40080580
Serial console setup
Adafruit CircuitPython 9.0.0-rc.0-7-g6a07412b07 on 2024-03-11; Adafruit Feather ESP32 V2 with ESP32
Board ID:adafruit_feather_esp32_v2
UID:8EF9D6220881

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
code.py output:
4.498: INFO - Running
5.552: INFO - No ath20 sensor found: No I2C device at address: 0x38
5.561: INFO - No library for the scd4x sensor
5.567: INFO - Connecting to wifi
5.572: INFO - Connected to XXX
5.581: INFO - Attempting to connect to MQTT broker 172.40.0.3:1883
5.624: INFO - Connected to MQTT Broker!
5.643: INFO - Temperature: 6.4 C
5.648: INFO - Humidity: 55.7 %
5.654: INFO - Battery capacity 88.80 %
5.661: INFO - Publishing to devices/terasa/shield
5.670: INFO - Published to devices/terasa/shield with PID 0
5.677: INFO - Running on battery power, breaking out
5.684: INFO - Going to light sleep for 10 seconds
15.692: INFO - Disconnected from MQTT Broker!
15.701: INFO - battery capacity more than 80, using shorter sleep of 60 seconds
15.711: INFO - Going to deep sleep for 60 seconds

Code done running.

Description

No response

Additional information

The code uses the watchdog module, however as I learned on https://forums.adafruit.com/viewtopic.php?p=994870#p994870 this is a case of internal RTOS watchdog and this should not be happening.

The complete code and description of my setup can be found on https://github.com/vladak/shield

I can test debug versions of CP.

RetiredWizard commented 7 months ago

I can't see how it would have anything to do with your issue, but it looks like you're overwriting your IMPORT_EXCEPTION variable after the import attempt. You probably want to test that the variable wasn't already initialized before setting IMPORT_EXCEPTION to None. I'm sure there's a better way but something like:

try:
    a = IMPORT_EXCEPTION
except:
    IMPORT_EXCEPTION = None
dhalbert commented 7 months ago

If you publish a lot more frequently to the local MQTT broker, does it fail faster?

We would be quite interested in a test program that is a minimal (fast) reproducible example. For instance, if you send fake data as opposed to actually reading the sensor, does it make any difference?

vladak commented 7 months ago

I can't see how it would have anything to do with your issue, but it looks like you're overwriting your IMPORT_EXCEPTION variable after the import attempt. You probably want to test that the variable wasn't already initialized before setting IMPORT_EXCEPTION to None. I'm sure there's a better way but something like:

try:
    a = IMPORT_EXCEPTION
except:
    IMPORT_EXCEPTION = None

Thanks. Maybe it would be sufficient to move the line with IMPORT_EXCEPTION = None above the try/except block ? Or use something like this ?

try:
    import wifi
    IMPORT_EXCEPTION = None
except MemoryError as e:
    # Let this fall through to main() so that appropriate reset can be performed.
    IMPORT_EXCEPTION = e
vladak commented 7 months ago

If you publish a lot more frequently to the local MQTT broker, does it fail faster?

We would be quite interested in a test program that is a minimal (fast) reproducible example. For instance, if you send fake data as opposed to actually reading the sensor, does it make any difference?

No, in fact is seems more stable. I have reduced the code to an endless loop that merely sends random data to the local broker:

    while True:                                                                 
        rnd_string = "".join(                                                   
            random.choice("abcdefghijklmnopqrstuvwxyz0123456789")               
            for i in range(random.randint(16, 128))                             
        )                                                                       
        data = {"time": time.monotonic_ns(), "data": rnd_string}                

        mqtt_topic = secrets[MQTT_TOPIC]                                        
        logger.info(f"Publishing to {mqtt_topic}")                              
        mqtt_client.publish(mqtt_topic, json.dumps(data))                       

        watchdog.feed()                                                         

        mqtt_timeout = 0.1                                                      
        logger.info(f"Waiting for MQTT event with timeout {mqtt_timeout} seconds")
        mqtt_client.loop(timeout=mqtt_timeout)

I removed the sensor readings and light/deep sleep. The try/except blocks were left intact. In order to let it run for long time I connected 10k mAh power bank to the charger. The code has been happily chugging along for more than ~a day~ two days (watching on the broker):

2024-03-22T23:14:01+0100 devices/terasa/shield: {"time": 104687703582770, "data": "0rv29dnsb5m9t86qxskocgau5"}
Client (null) received PUBLISH (d0, q0, r0, m0, 'devices/terasa/shield', ... (121 bytes))
2024-03-22T23:14:02+0100 devices/terasa/shield: {"time": 104687937377941, "data": "3fhb6wrt9bhsn5lhc94k2nbocb1g6xiipw2vokzofv5jxrmz8ef1bgupr7fvyqhkt99spw8w4fwsksv5z0ia"}
Client (null) received PUBLISH (d0, q0, r0, m0, 'devices/terasa/shield', ... (137 bytes))
2024-03-22T23:14:02+0100 devices/terasa/shield: {"time": 104688170989991, "data": "mocthougurtmpftuwy117jyrvgumtonlcyuel2zx8xrrgp6pn17svtsuheju9mui9s1y54n6vjikvp1hnrkevfbejhcy2o06ygel"}
Client (null) received PUBLISH (d0, q0, r0, m0, 'devices/terasa/shield', ... (128 bytes))
2024-03-22T23:14:02+0100 devices/terasa/shield: {"time": 104688406158459, "data": "xxl1sowjiyt5cp9w1b5sv7ae6hhzwj814qu8etvp1ob19rpqbdtpsu9j6i21gb9anu7tapf0aimjyi8wc9w6uqdy2t3"}
Client (null) received PUBLISH (d0, q0, r0, m0, 'devices/terasa/shield', ... (159 bytes))
2024-03-22T23:14:02+0100 devices/terasa/shield: {"time": 104688641784670, "data": "h3c7sdclomf2wus9p5z1wwbgb9j1s7tagq4afggqu34qj1dv8891djph9pjtln1pauw4mdl4gntzcj6no2ms4k0j5ds0kldky5f0paeghxogw0ymgi4k3505w0"}
Client (null) received PUBLISH (d0, q0, r0, m0, 'devices/terasa/shield', ... (72 bytes))
2024-03-22T23:14:02+0100 devices/terasa/shield: {"time": 104688870208741, "data": "0668si4bajmlg3ezqe5g24ku4pr7k096ry8"}
Client (null) received PUBLISH (d0, q0, r0, m0, 'devices/terasa/shield', ... (134 bytes))
2024-03-22T23:14:03+0100 devices/terasa/shield: {"time": 104689006256109, "data": "aik2f69y49spcjqmwmm7u4d6tdlaphmwqvnnnzd0agufxvh65shgg4l28gkwwjcv1sinv774j2i4dl0qpg0c4lvk7anwtpedt"}
Client (null) received PUBLISH (d0, q0, r0, m0, 'devices/terasa/shield', ... (65 bytes))
2024-03-22T23:14:03+0100 devices/terasa/shield: {"time": 104689235321058, "data": "1xrbu6pegh2ltvxyp5pjw7a5gyt2"}
Client (null) received PUBLISH (d0, q0, r0, m0, 'devices/terasa/shield', ... (121 bytes))
...

I am contemplating what to try next. Maybe I will try reintroducing the light+deep sleep.

vladak commented 7 months ago

Next thing I tried was adding the sensors back, so the program was acquiring the sensor readings and sending the data to the broker in a tight loop, every 0.1 second or so. The main loop looks like this:

    while True:                                                                 
        humidity, temperature, co2_ppm = sensors.get_measurements()             
        data = {}                                                               
        fill_data_dict(data, battery_monitor, humidity, temperature, co2_ppm)   

        if len(data) > 0:                                                       
            mqtt_topic = secrets[MQTT_TOPIC]                                    
            logger.info(f"Publishing to {mqtt_topic}")                          
            mqtt_client.publish(mqtt_topic, json.dumps(data))                   

        watchdog.feed()                                                         

        mqtt_timeout = 0.1                                                      
        logger.info(f"Waiting for MQTT event with timeout {mqtt_timeout} seconds")
        mqtt_client.loop(timeout=mqtt_timeout)

It has been running for more than a day without a problem.

dhalbert commented 6 months ago

There have been problems with MQTT keepalive timeouts in the past. I find the MQTT keepalive semantics confusing. But you should not be hitting an internal watchdog in any case.

I see in your setup documentation that there is some kind of 15 minute timeout on the local MQTT server. But are there other (default) timeouts that might be shorter?

vladak commented 6 months ago

There have been problems with MQTT keepalive timeouts in the past. I find the MQTT keepalive semantics confusing. But you should not be hitting an internal watchdog in any case.

I see in your setup documentation that there is some kind of 15 minute timeout on the local MQTT server. But are there other (default) timeouts that might be shorter?

The 15m cache timeout is only used in the MQTT Exporter (I had original problem with data not getting through over the WiFi for some reason so I extended the cache time to avoid N/A values in the Prometheus database, however as it is evident with my tight loop experiments above this is no longer an issue), so not related to the broker.

I have a long standing PR in the MiniMQTT that straightens the use of keep alive: https://github.com/adafruit/Adafruit_CircuitPython_MiniMQTT/pull/190 (which I'd appreciate if someone could look at) however that is not related to this issue.

vladak commented 6 months ago

The above tight loop with sensor readings has been running for 3 days contiguously (thanks to the power bank and also a bit due to the spring weather with some sunshine that charged the battery), without hitting the dreaded internal watchdog timeout. By now it has consumed both batteries so it is slowly descending towards drained battery, unless there is a lot of sunshine tomorrow which is likely, however because of these 2 properties (drained batteries and no sight in reproducing the problem quickly), it is time to conclude this step. For the curious, this is how the battery capacity looked in the last 24 hours (the power bank battery got consumed yesterday around 17:00 and the battery attached to the solar got charged a bit in the morning. The light monitoring comes from the system that runs the local MQTT broker and is in the near vicinity of the balcony that hosts the Feather so the light metrics are pretty accurate): feather32-v2-battery-with-powerbank

Anyhow, I am now thinking about more steps I can do:

vladak commented 6 months ago

In the next phase, I reduced the main() to basically this:

    watchdog.timeout = ESTIMATED_RUN_TIME                                       
    watchdog.mode = WatchDogMode.RAISE                                          

    pixel = neopixel.NeoPixel(board.NEOPIXEL, 1)                                

    logger.debug(f"MAC address: {wifi.radio.mac_address}")                      

    while True:                                                                 
        logger.info("enabling wifi")                                            
        wifi.radio.enabled = True                                               

        logger.info("Connecting to wifi")                                       
        wifi.radio.connect(secrets[SSID], secrets[PASSWORD], timeout=10)        
        logger.info(f"Connected to wifi")                                       
        logger.debug(f"IP: {wifi.radio.ipv4_address}")                          

        pool = socketpool.SocketPool(wifi.radio)  # pylint: disable=no-member   

        broker_addr = secrets[BROKER]                                           
        broker_port = secrets[BROKER_PORT]                                      
        mqtt_client = mqtt_client_setup(pool, broker_addr, broker_port, log_level, socket_timeout=0.1)

        logger.info(f"Attempting to connect to MQTT broker {broker_addr}:{broker_port}")
        mqtt_client.connect()                                                   

        rnd_string = "".join(                                                   
            random.choice("abcdefghijklmnopqrstuvwxyz0123456789")               
            for i in range(random.randint(16, 128))                             
        )                                                                       
        data = {"time": time.monotonic_ns(), "data": rnd_string}                

        mqtt_topic = secrets[MQTT_TOPIC]                                        
        mqtt_client.publish(mqtt_topic, json.dumps(data))                       

        blink(pixel)                                                            

        watchdog.feed()

        mqtt_timeout = 0.1                                                      
        logger.info(f"Waiting for MQTT event with timeout {mqtt_timeout} seconds")
        mqtt_client.loop(timeout=mqtt_timeout)                                  

        logger.info("Disconnecting from MQTT broker")                           
        mqtt_client.disconnect()                                                

        logger.info("Disabling wifi")                                           
        wifi.radio.enabled = False

It has been running for more than 3 days before the battery and the power bank got drained. Once I connected charged power bank back, it entered safe mode and reported 'Power dipped' which I guess is correct. I let it run for another 3 days, there were no problems. I verified with traffic dump that for each iteration of the cycle there was DHCP request/reply. The iterations were pretty slow, couple of seconds for each, thanks to the 'connecting to wifi' part - the rest of the iteration was quick. Sometimes it seemed that the blue neopixel was not blinking for extended time periods however it always recovered on its own. One factor that might have contributed to the stability is that I physically moved the Feather to a room with WiFi extender in the immediate vicinity (maybe 3 feet away) so I might re-try this by placing the Feather to its original place on the balcony which still has some 4 meters to the WiFi AP (plus the Feather has external antenna) but it might be worth trying this just in case.

After that, in the next phase I will try the light/deep sleep in a tight loop.

vladak commented 6 months ago

Also, I keep wondering whether the operating system could simply reset itself upon encountering the internal watchdog timeout, like many conventional OSes do.

vladak commented 6 months ago

In the next step I reduced the code to publish a random message to the MQTT broker and enter deep sleep for 1 second. Complete code is attached. It has been running for days on end (with the power bank attached to the battery charger). I have almost run out of ideas on what to try next. I am now seriously considering returning back to CP 8. Again, if someone can produce a CP 9 debug build that will capture the state of the RTOS when it hits the internal watchdog timer, I will run it and capture the necessary data.

code.py.txt

vladak commented 6 months ago

Also, I keep wondering whether the operating system could simply reset itself upon encountering the internal watchdog timeout, like many conventional OSes do.

I have just discovered safemode.py and https://gist.github.com/anecdata/fe35dc6a94069fc920edf61a64750b53 which looks very useful (thanks @anecdata !). While it seems it will suffice to workaround my problem, it would be nice to have the means to find the root cause. Could CP possibly extract even more information (register dumps, stack traces) from the RTOS and store it somewhere in supervisor.runtime or such so that it can be consumed and passed on ?

dhalbert commented 5 months ago

The MQTT library and other libraries have received a bunch of changes over the past week or so, including having to do with timeouts. Could you re-test with 9.1.0-beta.2 and the latest libraries so we can get a new baseline? There are changes that will not be in the bundles until tonight, so you could fetch from the repos or just wait until tomorrow.

vladak commented 5 months ago

The MQTT library and other libraries have received a bunch of changes over the past week or so, including having to do with timeouts. Could you re-test with 9.1.0-beta.2 and the latest libraries so we can get a new baseline? There are changes that will not be in the bundles until tonight, so you could fetch from the repos or just wait until tomorrow.

I have since switched to Feather ESP32-S2 with RFM69 which works really well and consumes far less power than the Wifi transport so I can no longer reproduce it (I think, will have to check the safemode logs next time I get to connect the Feather over USB). Anyway, I don't think the recent MiniMQTT changes would fix this. It does not probably make sense to keep this issue open, until someone else hits it.