espressif / arduino-esp32

Arduino core for the ESP32
GNU Lesser General Public License v2.1
13.27k stars 7.35k forks source link

light_sleep in v3.0-dev has several issues #908

Closed tferrin closed 5 years ago

tferrin commented 6 years ago

Hardware:

Board: Adafruit Huzzah32 Core Installation/update date: 6/dec/2017 (v3.0-dev-1325-ge5b2c1cf) IDE name: Arduino IDE 1.8.5 Flash Frequency: 80MHz Upload Speed: 115200

Description:

There are several problems with the current implementation of light_sleep. Since these are all related I'm including them together in one ticket.

Synopsis: 1) A timer-based light_sleep sometimes never wakes up. The WDT appears to never trigger and the call to esp_light_sleep_start() never returns. 2) The millis() timer doesn't tick during light_sleep leading to values that have nothing to do with the current run time of the program. It is easy to correct the value returned by millis() in user code, but a better place to do this is in the underlying implementation of millis(). 3) When light_sleep is made to work for longer time periods (see details below), the getimeofday() and time() clocks tick too fast.

Details: 1) timer-based light_sleep sometimes never wakes up: calling suspend() in the sketch below will eventually cause esp_light_sleep_start() to never return. Small values of sleep (e.g., 10ms) will usually cause the hang to happen within a few minutes. Larger values (e.g., 1000ms) may take a few hours before the hang occurs. To me, this behavior is indicative of a race condition in esp_light_sleep_start() that is "tickled" the more frequently the function is called. I have tried several fixes for this issue without success and its root cause remains elusive.

2) millis() timer doesn't tick during light_sleep: When light_sleep stalls the cpus, normal timers don't tick, including the timer used for the millis() function, and this can often lead to unexpected behavior in programs. The IDF provides the needed hooks to correct millis()'s notion of time (see adjustedMillis() in attached source code), but IMHO it would be much better if the underlying implementation of millis() incorporated this timing correction.

3) light_sleep causes getimeofday() and time() to tick too fast: The sample output below provides the values of several time counters each second. Over an interval of a couple hours, these differ from real time by ~8 seconds on my WROOM-32 module. To see this, run the sketch and compare its output with the stopwatch on your smartphone. This behavior may be a byproduct of the different oscillator used for the RTC during light_sleep mode, but trying to re-sync the ESP32 TOD clock via NTP doesn't seem to work because the difference between the real (NTP) time and ESP32 TOD clock keeps growing. If the TOD clock cannot be made more accurate, then at least it needs to be correctable by LWIP's sntp module. [An example of attempting to re-sync the TOD clock via NTP is not included here but can be furnished on request.]

Sketch:

Program source: programSource.txt

Debug Messages:

Sample output: sampleOutput.txt

helmut64 commented 6 years ago

Hi Tom, I am working also on sleep code with the ESP32, I am using as well direct programmed 64-bit timers for wakeup and time counting, here are come comments from an ESP32 beginner.

a) Check out that you are using chip revision 1 (ESP.getChipRevision()) With the 64-bit timers I had major reliability problems, they completely disappeared with rev1. Maybe the same may be true for your freezing light sleep. The good news is that rev1 fixes the long errata list, so I start my project with rev1 only.

b) 32KHz RTC crystal clock I know that some ESP32 modules are (e.g. Heltec) using an additional 32kHz RTC clock, however I don’t know if the existence of an crystal will be detected automatically or if special SDK lib versions are needed. With the DevKit modules they use an internal RTC ossizillator and it does not work accurate.

c) Estimating sleep duration For this I read the RTC time before sleep and on the wakeup after sleep. The elapsed time I add to my own time counting. Unfortunately this is not very accurate but better than nothing. The code is at end.

Another fact is that a reset (EN low/high) will reset the RTC to zero even a battery is connected. For more accurate time an external temperature compensated RTC chip may be needed, e.g. DS3231.

Regards Helmut

static uint64_t ESP32_get_rtc_time_us()
{
    SET_PERI_REG_MASK(RTC_CNTL_TIME_UPDATE_REG, RTC_CNTL_TIME_UPDATE_M);
    while (GET_PERI_REG_MASK(RTC_CNTL_TIME_UPDATE_REG, RTC_CNTL_TIME_VALID_M) == 0) {
        ;
    }
    CLEAR_PERI_REG_MASK(RTC_CNTL_TIME_UPDATE_REG, RTC_CNTL_TIME_UPDATE_M);
    uint64_t low = READ_PERI_REG(RTC_CNTL_TIME0_REG);
    uint64_t high = READ_PERI_REG(RTC_CNTL_TIME1_REG);
    uint64_t ticks = (high << 32) | low;
    return ticks * 100 / (rtc_clk_slow_freq_get_hz() / 10000); // scale RTC_CTNL_SLOWCLK_FREQ to avoid overflow
}
helmut64 commented 6 years ago

Hi Tom, I was testing your sample code and got the problem you are describing. Two symptoms I have seen: (rev1 DevKit)

tferrin commented 6 years ago

I checked and the chip revision of my WROOM-32 is 1, so it's not the old version of the chip. The Adafruit board I have does not use an external 32KHz crystal for the RTC, so an inaccurate internal oscillator could easily account for the fast ticking I've observed. The delay(1) after suspend is most interesting. I'll give that a try tonight. Perhaps other RTOS tasks need a chance to run after the cpus have been stalled.

lonerzzz commented 6 years ago

@tferrin @helmut64 Keep in mind that the delay(1) causes a task switch rather than actually performing a delay on the ESP32 so that might be feeding another task/thread for you.

tferrin commented 6 years ago

Yes. Understand. I've added a call to yield() -- which I'm pretty sure just calls vPortYield() on the ESP32 -- immediately after esp_light_sleep_start() and my test program has been running for an hour now, so a good sign. Needs more testing but my fingers are crossed.

Just to be clear, my suspend() function is now...

void suspend(uint32_t ms)
{
  uint64_t time_in_us = ms * 1000LL;
  esp_err_t ret;

  ret = esp_sleep_enable_timer_wakeup(time_in_us);
  if (ret == ESP_OK) {
    ret = esp_light_sleep_start();
    yield();     // <== NEW!
  }
  if (ret != ESP_OK) {
    log_e("light_sleep error: %d", ret); // likely WiFi wasn't off
    delay(ms); // if we can't stall the cpus at least we can call delay
  }
}

EDIT: Well, no change in behavior. Test program hung after running for a couple of hours, same as before.

tferrin commented 6 years ago

I decided to simplify the test program that calls the suspend() function posted above and not do any serial output at all. Instead I just flash a LED every 1 second or so, then call suspend() again. This has now been running ~6 hours without hanging! So I'm now thinking that serial output to the USB bridge plays a role in causing the hangs I described in my first post above.

helmut64 commented 6 years ago

What I do for another MCU (STM32L4 mbed), before I go to deep sleep I wait that the serial output is flushed to avoid problems. I have not looked into the details of the serial output implementation of the ESP32 for Arduino. For the Arduino D21 (SAMD) the code it is being a synchronous output.

I also noticed that an ESP32 interrupt routing print on the serial may hang sometimes, looks like the prints go async.

tferrin commented 6 years ago

Okay, after doing a lot more testing I've reached the conclusion that the fundamental problem of the ESP32 hanging during light_sleep has nothing to do with serial I/O. I think it is related to calling the gettimeofday() function. I've provided a new test case below that is essentially Espressif's own unit test code, except instead of calling the light_sleep function only once I call it repeatedly. As noted in my original post above, how long the program sleeps for has a direct bearing on how long this code runs before hanging. Long sleep times rarely hang, while short sleep times will hang the ESP32 within a couple minutes. The test code here will hang in an indeterminate processor state within a minute or two.

#include <sys/time.h>

#define TEST_ASSERT_INT32_WITHIN(delta, expected, actual) \
    if (abs(actual - expected) > delta) printf("assertation failed")

// choose one of the following...
#define VERY_SHORT_SLEEP
//#define SHORT_SLEEP
//#define LONG_SLEEP

#ifdef LONG_SLEEP
  #define SLEEP_TIME 2000000    // 2s
  #define SKIPCNT 0
#elif defined(SHORT_SLEEP)
  #define SLEEP_TIME 250000     // 250ms
  #define SKIPCNT 3
#elif defined(VERY_SHORT_SLEEP)
  #define SLEEP_TIME 10000      // 10ms
  #define SKIPCNT 75
#else
  #error "must define a sleep interval"
#endif

#define EXPECTED (SLEEP_TIME/1000)

void dotest()
{
    // code below from esp-idf-master/components/esp32/test/test_sleep.c
    // see entry TEST_CASE("wake up from light sleep using timer", "[deepsleep]")

    esp_sleep_enable_timer_wakeup(SLEEP_TIME);
    struct timeval tv_start, tv_stop;
    gettimeofday(&tv_start, NULL);
    esp_light_sleep_start();
    gettimeofday(&tv_stop, NULL);
    float dt = (tv_stop.tv_sec - tv_start.tv_sec) * 1e3f +
               (tv_stop.tv_usec - tv_start.tv_usec) * 1e-3f;
    TEST_ASSERT_INT32_WITHIN(500, EXPECTED, (int) dt);
}

void heartbeat(int skipcnt = 0)
{
  static int n = 0;
  if (skipcnt < 0) return;
  if (n++ >= skipcnt) {
    digitalWrite(LED_BUILTIN, HIGH);
    delay(50);
    digitalWrite(LED_BUILTIN, LOW);
    delay(50);
    n = 0;
  }
}

void setup()
{
  pinMode(LED_BUILTIN, OUTPUT);
  Serial.begin(115200);
  Serial.printf("\nSDK version: %s\n", ESP.getSdkVersion());
  delay(100);   // make sure output has drained
}

void loop()
{
  heartbeat(SKIPCNT);
  dotest();
}
Borisvl commented 6 years ago

I also have experienced this problem. Does anyone have an update on this issue?

Electrozest commented 6 years ago

I am seeing exact same issue when esp_light_sleep_start() is called. I am typically calling with a delay less than 1000ms set with the esp_sleep_enable_timer_wakeup(microseconds) function. Periodically at random the esp_light_sleep_start() function blocks and never returns to the calling code.

Even worst, the esp_light_sleep_start() function is enabling and disabling the hardware watchdog timer, and the hardware watchdog timer is sometimes in the disabled state when the block occurs, which means that the CPU effectively stalls with no hardware WDT reset or autonomous recovery from an error state.

Calling the esp_light_sleep_start() function once every second the ESP32 becomes unusable after a few (1-3) hours and a manual reset is then required to recover. This makes the ESP32 unsuitable for tasks requiring remote operation if light sleep is required and hardware WDT is not guaranteed to operate.

Electrozest commented 6 years ago

This is a critical issue for anyone considering the use of the ESP32 with light sleep mode.

Electrozest commented 6 years ago

In sleep_modes.c in esp_light_sleep_start() function, after the code which potentially blocks, the hardware WDT is enabled with a fixed 1 second timeout:

// Safety net: enable WDT in case exit from light sleep fails
rtc_wdt_enable(1000);

The above means that light sleeps greater than one second will trigger the hardware WDT watchdog timer and cause a reset, so above code does not take into account the length of the light sleep which may be longer than 1 second.

After the light sleep the function calls: rtc_wdt_disable();

The above is called before: portEXIT_CRITICAL(&light_sleep_lock);

Not sure if the above can block?? If it does the hardware WDT is disabled. Also after this point the users code is no longer protected by the hardware WDT if they wanted to use that feature.

tferrin commented 6 years ago

Note that Espressif is planning on implementing "automatic" light sleep in the IDF soon, as mentioned here: https://github.com/espressif/arduino-esp32/pull/1165#issuecomment-376717287

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This stale issue has been automatically closed. Thank you for your contributions.