espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.66k stars 7.29k forks source link

esp_restart() intermittently won't reset, causing stuck esp32 (IDFGH-258) #2101

Closed Curclamas closed 1 month ago

Curclamas commented 6 years ago

Environment

Problem Description

Under some circumstances esp_restart() won't actually reset the ESP32 leaving it in a state where no more console output is generated and no reset done. The ESP32 only reboots then after power cycle or EN-toggle.

Expected Behavior

Under all circumstances a call to esp_restart() should reboot the device. Also when the device is in a "locked" state, there should be a watchdog reset.

Actual Behavior

Sometimes esp_restart() get's stuck. Then either no more console output is produced or just a lot of gibberish with no subsequent console output. No watchdog reset occurs.

Steps to repropduce

Hard to reproduce cause this happens in some safeguard timer functions (FreeRTOS timers) which are executed when some task won't finish in time.

Code to reproduce this issue

Example:

#include "arduino.h"
// this is the callback that is called by the timer
void resetCallback(void * args) {
    ESP_LOGE(ResetTimer::LOG_TAG, "ResetTimer not stopped in time, going down for reset in 5 secs");
    WiFi.disconnect();
    delay(5000);
    ESP_LOGW(ResetTimer::LOG_TAG, "going down for reset now"); //this line is actually printed
    ESP.restart(); //here either no more console output or random gibberish
}

Debug Logs

looks like that: EDIT: uploaded real log

W (446999) ResetTimer: going down for reset now
D (446999) event: SYSTEM_EVENT_STA_STOP
I (446999) wifi: flush txq
[D][WiFiGeneric.cpp:293] _eventCallback(): Event: 3 - STA_STOP
I (446999) wifi: stop sw txq
I (447005) K&O-Logger: Unhandled WiFiEvent //this is sent by WiFi Event handler
I (447010) wifi: lmac stop hw txq
D (447016) RTC_MODULE: Wi-Fi returns adc2 lock.
//nothing or gibberish

Other items if possible

sdkconfig.txt

igrr commented 6 years ago

Hi @Curclamas, i have a few questions: Do you know if the issue also happens without the WiFi.disconnect call? Does the issue happen in the minimal example (where you would attach resetCallback to a FreeRTOS timer from app_main, and enable WiFi), or something else is needed to reproduce this? How often do you observe this issue (once in how many resets)?

Curclamas commented 6 years ago

Hi @igrr thanks for your reply!

I can do some tests at the end of this week in order to narrow down the problem a little bit.

Two questions for you:

igrr commented 6 years ago

My gut feeling says we are seeing issues because resetCallback is called from low prio task and we maybe have some

I recall that in the other issue you raised (where spi_flash operation would deadlock) you had a high priority task doing HTTPS request and hogging all the CPU time until it receives the data it expects. What may be happening here is something similar: once WiFi is stopped, no more data is being received, but your HTTPS task is always in ready state (i.e. not getting blocked on a socket read). Because of this, timer task (or any lower priority task where esp_restart is called) does not get CPU time. This is a deadlock situation. If you still have high-priority worker tasks in your application, please check that under no circumstance they will keep polling something for extended periods of time, without going into blocked state, even if WiFi connection is stopped.

What issues could prevent a reset

At the moment the two possibilities I see are: 1) deadlock of some sorts, before esp_restart suspends the scheduler, and 2) bug in the actual chip reset code, which puts hardware into a halted state. We had bugs like 2) in the past, but recently such issues have not been reported. We can try to debug this, but need a viable way to reproduce the issue first...

How should such a "safety-reset-timer" be implemented better

In general, calling esp_restart should work both in esp_timer callback and FreeRTOS timer callback. Regarding doing complicated stuff in the callback, it is usually not recommended, but in this case the callback is the last thing that will run before the application restarts, so as long as the stack size is sufficient, you can do more complex things there.

Curclamas commented 6 years ago

@igrr thanks for the comprehensive reply. I have also a suspicion that something is hogging CPU, but there are two objections to that:

But nonetheless I will try some things out.

I have found out one more thing in the meanwhile: The Problem seams to be actually two different issues:

Curclamas commented 6 years ago

@igrr I have sad news:

So I updated the callback:

void resetCallback(void * args) {
    ESP_LOGE(ResetTimer::LOG_TAG, "ResetTimer not stopped in time, going down for reset NOW!");
    ESP.restart(); //here either no more console output or random gibberish
}

And now it's still stuck posting this to the logs:

E (24129219) ResetTimer: ResetTimer not stopped in time, going down for reset NOW!
I (24129220) wifi: state: run -> init (0)
I (24129220) wifi: pm stop, total sleep time: 22490421499 us / 24121253132 us

I (24129227) wifi: n:1 0, o:1 0, ap:255 255, sta:1 0, prof:1
D (24129232) event: SYSTEM_EVENT_STA_DISCONNECTED, ssid:K-O, ssid_len:3, bssid:00:1e:42:14:ec:56, reason:8
D (24129242) tcpip_adapter: if0 start ip lost tmr: enter
I (24129242) wifi: flush txq
D (24129247) tcpip_adapter: if0 start ip lost tmr: interval=120
I (24129250) wifi: stop sw txq
I (24129260) wifi: lmac stop hw txq
D (24129262) RTC_MODULE: Wi-Fi returns adc2 lock.
D (24[D][WiFiGeneric.cpp:293] _eventCallback(): Event: 3 - STA_STOP
129267) event: SYSTEM_EVENT_STA_STOP

as you see interesting enough there is still some WiFi stuff happening here, so I have the strong opinion we maybe found a bug. Maybe one side note: The reset timer was fired because one HTTPS request was stuck / taking over 2min or so. The main cause for this is our GSM/3G-Modem losing reception. So the ESP32 is still connected to WiFi and has an IP but the AP can not route any packages to the outside world. In our application this might happen any time.

EDIT: So to make this clear: the esp_restart() seams to be called (I am 100% that this causes the WiFi events and stuff) but does not properly reset the device. I think independent from the implementation here this should have worked. There is most probably some buggy edge case going on here.

vonnieda commented 6 years ago

@Curclamas Do you use a 32kHz RTC crystal on the custom board? Try disabling that in menuconfig. I've run into this issue with my board where a failure of the crystal to start will keep esp_restart() from working.

Curclamas commented 6 years ago

@vonnieda interesting to know! I've actually got an 32kHz crystal but it seams to work. Upon startup I get:

I (532) cpu_start: Pro cpu start user code
D (537) clk: waiting for 32k oscillator to start up
D (574) clk: RTC_SLOW_CLK calibration value: 15999232

nonetheless I will try to disable it and see if that helps.

EDIT: You're problem should be (if it differs from mine) also considered a critical bug I guess. Maybe you want to report it, when I've checked if its something different for me.

vonnieda commented 6 years ago

@Curclamas Yep, I had the same situation. The clock would start and then on some of them (with bad load caps) it would eventually stop, and this would cause esp_restart() to fail. Disabling the RTC clock in the menuconfig fixed it in every case.

Curclamas commented 6 years ago

@vonnieda some early results indicate this could have solved the issue, thank you very much!

@igrr it would be interesting to see investigated what causes this bug (from a software point of view) and from a hardware point of view: how to make sure that the RTC crystal does not stop? How could it be detected? How should the load capacitors be selected? I found some formulas on the internet but I did not fully understand yet how it's calculated, so I made a "ball park guess". I'd love if somebody could shine a little bit of light onto this.

vonnieda commented 6 years ago

@Curclamas Glad to hear it!

KonstantinKondrashov commented 6 years ago

Hi @Curclamas. In your sdkconfig file there is an option equals 100. This option defines how much impulses will get the crystal to start it. CONFIG_ESP32_RTC_XTAL_BOOTSTRAP_CYCLES=100. In recent IDF version, this option reduces to 5 (by default). We decreased this option because the oscillator can be oscillating and then stops. Also if crystal 32KHz does not run then a start will be from the internal RC of the chain with fit an error message.

Could you please to use options CONFIG_ESP32_RTC_XTAL_BOOTSTRAP_CYCLES=5 and 'CONFIG_ESP32_RTC_CLK_CAL_CYCLES=3000' and will try again. I think while start you will have a message RTC: Not found External 32 kHz XTAL. Switching to Internal 150 kHz RC chain.

Curclamas commented 6 years ago

@KonstantinKondrashov that solution could also work but for me it sounds a little risky to trust that the oscillator (if it's bad) will stop when there are only 5 cycles.

@igrr @vonnieda as indeed the RTC crystal seems to be the problem here: would there be some kind of possibility to add a safeguard against this failure mode? The idea would be to detect a crystal failure from the main CPU clock / ABP clock and switch over to internal RC chain.

Also any notes on the design of the actual crystal osciallator circuit would be appreciated since I followed some guidelines I found online but many of them differ. It would be interesting to see how you at espressif pick the RTC crystal and load caps.

JonathanWitthoeft commented 5 years ago

@igrr Can someone from Espressif please give an update on this issue? We are going to production and about 20% of our boards are giving us this error. We need to use a 32kHz external XTAL and disabling it is not a valid solution. So can we get a response on the following: 1) A guideline on the crystal oscillator circuitry including load capacitor values 2) A suggestion for config values to fix this issue

eagi223 commented 5 years ago

We're also having this issue still at v3.1.2. I've seen plenty of issues here and posts on esp32.com of people having this same issue, but just silence from Espressif. Is this being figured out or ignored?

Having some guidance on the crystal load cap values would be extremely helpful, as would an IDF fix that prevents the board from locking up/preventing deep sleep wakeup when/if this failure happens.

On our boards (custom using just the ESP32 IC with an external 32kHz crystal, not a ESP32 module) we'll have some devices that will never recover when esp_reboot() is called (checking the crystal with an oscilloscope shows that it failed to start AND it isn't defaulting to the internal crystal. Some boards will have the crystal stop after boot up (which appears to not crash the firmware as it keeps operating normally, but will never wakeup from deep sleep). We're having boards totally bricked in the field when users OTA to our newest firmware because (we're assuming) we've enabled the external crystal.

darkfader commented 5 years ago

I had issues with a small 32kHz crystal too (9pF, 2x18pF). What I now do for now is manually starting the crystal (once, after retrieving settings from server), wait 10 sec. and see if it's still running; if it is, keep using it. Also had to increase DAC_VAL to 2 (defaults to 1) for more drive strength?. And wrap the enable/select_rtc_slow_clk in startup code with "if (!rtc_clk_32k_enabled()) { ... }" to keep those RTC values after waking up from deep sleep (and without disturbing the nicely calculated ULP wakeup timer). https://github.com/darkfader/esp-idf/commits/xtal_32k_checks Edit: some additional code is required still: calibration.

thehellmaker commented 5 years ago

Hi Guys, I have ESP stuck in production once everymonth or so that doesn't happen in lab testing. But I believe this is the issue. I wanna try disabling RTC.

I went thru all menuconfigs in https://docs.espressif.com/projects/esp-idf/en/latest/api-reference/kconfig.html#config-esp-task-wdt

But I am not sure how to disable RTC as mentioned above. Which option should i disable in menuconfig

vonnieda commented 5 years ago

I am experiencing this now in v3.3. After an OTA completes I call esp_restart from a timer task, and it fails every time. The timer callback is simply:

static void xx_hal_shutdown_timer_cb(void* arg) {
    switch (shutdown_type) {
        case XX_HAL_SHUTDOWN_TYPE_RESTART: {
            esp_restart();
            break;
        }
    }
}

This worked fine in 3.0, but in 3.3 I see on the console:

I (75462) wifi: state: run -> init (0)

The chip is still responsive if I then issue another esp_restart() from a different task the chip will restart fine. It seems to be specific to running from the timer callback.

vonnieda commented 5 years ago

Based on comments above, I am focusing on WiFi. I have now found that if I don't initialize WiFi at all, this works fine, but once I've initialized WiFi it doesn't work. If I initialize WiFi but don't connect it still doesn't work. In that case I get:

I (13053) wifi: flush txq
I (13053) wifi: stop sw txq
vonnieda commented 5 years ago

I have now verified that this is related to WiFi and have filed a new issue at #4115.

ryan4volts commented 3 years ago

Interesting to note that I can reproduce this with custom hardware and using the simple hello-world app and enabling RTC. This occurs for me on ~20% of my boards also.

Jayshree30 commented 3 years ago

Environment

  • Development Kit: custom board
  • Core (if using chip or module): ESP32-D0WDQ6
  • IDF version: master
  • Development Env: C++ / arduino as idf component
  • Power Supply: external 5V

Problem Description

Under some circumstances esp_restart() won't actually reset the ESP32 leaving it in a state where no more console output is generated and no reset done. The ESP32 only reboots then after power cycle or EN-toggle.

Expected Behavior

Under all circumstances a call to esp_restart() should reboot the device. Also when the device is in a "locked" state, there should be a watchdog reset.

Actual Behavior

Sometimes esp_restart() get's stuck. Then either no more console output is produced or just a lot of gibberish with no subsequent console output. No watchdog reset occurs.

Steps to repropduce

Hard to reproduce cause this happens in some safeguard timer functions (FreeRTOS timers) which are executed when some task won't finish in time.

Code to reproduce this issue

Example:

#include "arduino.h"
// this is the callback that is called by the timer
void resetCallback(void * args) {
    ESP_LOGE(ResetTimer::LOG_TAG, "ResetTimer not stopped in time, going down for reset in 5 secs");
    WiFi.disconnect();
    delay(5000);
    ESP_LOGW(ResetTimer::LOG_TAG, "going down for reset now"); //this line is actually printed
    ESP.restart(); //here either no more console output or random gibberish
}

Debug Logs

looks like that: EDIT: uploaded real log

W (446999) ResetTimer: going down for reset now
D (446999) event: SYSTEM_EVENT_STA_STOP
I (446999) wifi: flush txq
[D][WiFiGeneric.cpp:293] _eventCallback(): Event: 3 - STA_STOP
I (446999) wifi: stop sw txq
I (447005) K&O-Logger: Unhandled WiFiEvent //this is sent by WiFi Event handler
I (447010) wifi: lmac stop hw txq
D (447016) RTC_MODULE: Wi-Fi returns adc2 lock.
//nothing or gibberish

Other items if possible

sdkconfig.txt

Hi,

Did anyone got solution for this issue, as i am also facing the same issue of STA-disconnect whenever use esp_restart(); i am using in our application. Wifi get stuck and further not able to connect with wifi till we manually reset the device. Please let me know if any follow up or solution for this to resolve the issue.

[2021-06-21 13:25:51.099] I (2650039) wifi:state: run -> init (0) [2021-06-21 13:25:51.099] I (2650039) wifi:pm stop, total sleep time: 2261045237 us / 2646044124 us [2021-06-21 13:25:51.099] [2021-06-21 13:25:51.099] I (2650039) wifi:new:<10,0>, old:<10,0>, ap:<255,255>, sta:<10,0>, prof:1 [2021-06-21 13:25:51.099] W (2650039) wifi:hmac tx: stop, discard [2021-06-21 13:25:51.099] W (2650049) wifi:hmac tx: stop, discard [2021-06-21 13:25:51.099] I event_manager: 295: SYSTEM_EVENT_STA_DISCONNECTED] [2021-06-21 13:25:51.099]

sblantipodi commented 3 years ago

same issue here, ESP.restart() makes the ESP32 stuck and a manual reboot is needed.

DatanoiseTV commented 2 years ago

Any update on this one? @espressif-bot

wolfwander commented 2 years ago

I have the exact same issue. My ESP32 get stuck after and give me gibberish chars on serial output.

AxelLin commented 2 years ago

Hi,

Did anyone got solution for this issue, as i am also facing the same issue of STA-disconnect whenever use esp_restart(); i am using in our application. Wifi get stuck and further not able to connect with wifi till we manually reset the device. Please let me know if any follow up or solution for this to resolve the issue.

[2021-06-21 13:25:51.099] I (2650039) wifi:state: run -> init (0) [2021-06-21 13:25:51.099] I (2650039) wifi:pm stop, total sleep time: 2261045237 us / 2646044124 us [2021-06-21 13:25:51.099] [2021-06-21 13:25:51.099] I (2650039) wifinew<10,0>, old:<10,0>, ap:<255,255>, sta:<10,0>, prof:1 [2021-06-21 13:25:51.099] W (2650039) wifi:hmac tx: stop, discard [2021-06-21 13:25:51.099] W (2650049) wifi:hmac tx: stop, discard [2021-06-21 13:25:51.099] I event_manager: 295: SYSTEM_EVENT_STA_DISCONNECTED] [2021-06-21 13:25:51.099]

The STA-disconnect whenever use esp_restart() is a different issue. Please create a new issue instead.

wolfwander commented 2 years ago

What is your esp-idf version (git describe --tags )?

I'm using Arduino IDE cause I ported an old project to ESP32.

thehellmaker commented 2 years ago

esp-idf or ArduinoIDE we faced the same issue over 1 year. We realized post serious debugging that ESP gets stuck completely if there is electrical noise in the system and never gets back. The only solution for this is a hardware Watchdog. We used a push pull circuit to ensure if ESP is not giving watchdog feed on a particular GPIO connected to external hardware WDT and external circuit resets ESP32

roysG commented 2 years ago

Any update on this issue?!

roysG commented 2 years ago

Someone tried this solution? https://github.com/espressif/esp-idf/issues/4115#issuecomment-565368997

Or maybe this one: https://iotassistant.io/esp32/enable-hardware-watchdog-timer-esp32-arduino-ide/

?

ozanerturk commented 1 year ago

We also faced a similar issue as yours. Our setup includes an ESP-32, IDF version 4.3.2, and we use both WIFI in AP and STA mode as well as MQTT connections. It is a difficult problem to diagnose, as we have dozens of devices connected to various access points all around the world.

It seems that the issue is dependent on the access point to which our devices are connected. When the device executes the esp_restart() function, it becomes stuck. However, we have implemented a high-priority task to restart the device each hour. his task saved our ass so many times. And it shows us that even the device becomes stuck some tasks keep working in the background. We believe esp stuck when trying to kill wifi tasks and unable to continue the restart routine.

You can find some of our finding below, hope you can find some similarity with yours

Best regards,Fail-fast

evtimDev commented 1 year ago

We recently hit an issue where esp_restart() will not reboot the device while https post is hanging, not sure if this is the same as the issue described here.

The solution that worked for us was to use esp_system_abort() instead - that call was succeeding in rebooting the esp.

AxelLin commented 1 year ago

@igrr Any further update for the issue? The stuck issue could be a big issue because some device is installed at a remote site. It takes many hours drive to the remote site just for manually re-power-on the device when it got stuck.

MisterScience5 commented 11 months ago

This seems relatively similar to my issue, WIFI-MESH: Loss of parent connection causes lockup, requires watchdog reboot (IDFGH-11108) - however, I have been able to reproduce this issue with development boards also.

AxelLin commented 9 months ago

@0xjakob Any update to fix this issue?

AxelLin commented 8 months ago

@0xjakob Any update to fix this issue?

@0xjakob Are you still working on this issue?

0xjakob commented 8 months ago

@AxelLin So far we didn't find the root cause of this problem. It is still in our backlog, though, and we won't forget it.

AxelLin commented 6 months ago

@0xjakob

The consequence of this issue is a dead device which needs manually re-poweron. Any chance to raise the priority?

cristovao-trevisan commented 6 months ago

This issue has bothered me for a few years, and there are a few github issues that are probably related to this. We have a few products that use ESP32 and stopped working on the field, and the only solution is a manual reboot. After a long time dealing with this I've found the root cause and have a few possible solutions.

Root cause

The cause is the 32k external clock. If you're not using, the problem is unrelated to mine. It just stops clocking, so if you're using light/deep sleep with external clock you may run into this issue. There are a few triggers for this to happen more often, which are related the the crystal physics:

Solutions

In my case I use a combination of the following solutions:

0xjakob commented 6 months ago

Hello,

indeed, it looks like the main issue is with the crystal not working properly (see also the assessment of other users above). Espressif actually has a hardware design guide for the ESP32 for a while now. If there is a problem with a custom PCB (i.e., when not using any Espressif module), please check with the hardware design guide.

If anyone is experiencing the above problem with an Espressif module or development board, please let us know here.

0xjakob commented 1 month ago

Hello again.

As most of the posts here seem to be related to the crystal oscillator and the hardware design guide contains guidelines for the crystal area, we're closing this issue. If you have an issue that cannot be resolved by following the hardware design guide, please open another issue with a detailed description, including whether an Espressif module or a custom PCB has been used. If you have a similar issue that you believe is not related to the crystal, please open a new issue as well.

As a rule of thumb, if you have a device that shows not the exact same problem (after reduction to a minimal reproducing example), you may open a new issue. Separate issues help us to follow up on the individual problems and to sort and categorize to get an overview, even if some turn out to have the same root cause later on.

Thanks!