Improve documentation & handling of spurious watchdog reset after power up in ESP32 revision 0 devices, often perceived as WiFi failure on every other power on.

gsuberland commented 3 years ago

Espressif published an ECO documenting errata in the ESP32 revision 0 silicon. The first issue, 3.1, is "A spurious watchdog reset occurs when ESP32 is powered up or wakes up from Deep-sleep". This issue has been fixed in a newer silicon release, but a large proportion of the ESP32 development boards on the market suffer from this issue. I have a number of ESP-WROOM-32 modules that are affected.

ESP-IDF v1.0 and later apparently has a workaround for the deep sleep side of things (although judging by #796 a complete fix may not be integrated here yet?) but on first power reset (e.g. after programming) affected ESP32 devices will initially appear to (mostly) work, but then reset after a short time due to the spurious WDT reset. A side-effect is that some of the inbuilt peripherals fail to behave properly during this first boot, whereas serial output does work, leading to misunderstandings of the root cause of the problem when users run into it.

I believe that a reasonable portion of users reportedly experiencing issue #2501 are actually running into this silicon bug. Most users' code attempts to call WiFi.begin() in setup. During first boot after power on, access to the WiFi peripheral is flaky. The begin() call can hang, or it can appear to succeed but the status will never reach the connected state. After a while the spurious WDT reset will kick in and the device will boot as normal. At this point the device will function correctly. This can produce the impression that the WiFi works every other time. Users who write something like WiFi.begin(...); /* ... */ WiFi.begin(...); may be tricked into thinking that this works because they see two sets of output, which they attribute to their code, when in reality the board is hanging on the first power on and running normally after the WDT reset. WiFi is almost certainly nothing to do with it, it just happens to be something that causes setup to take long enough during setup to coincide with the spurious WDT reset.

Put simply, the behaviour of this silicon bug is consistent with user issues that follow this pattern:

Device is powered on, reset via the button, or flashed
Serial output comes through
An attempt is made to use some functionality (e.g. WiFi)
That attempt hangs or otherwise misbehaves
The device resets roughly 10 seconds after boot
Things start working again on the second run

Users with debug output enabled at the same baud rate as the Serial peripheral is set to will see the WDT reset information coming through at the point of rebooting. Users with debug output enabled, but with the baud rate configured to a different value, will likely see junk on the serial output just before the device starts working again. Users with debug output disabled will just see the odd behaviour, the hang, and normal behaviour afterwards.

On affected ESP32 devices this issue reliably occurs on the first power on and after flashing, because both of these events trigger a full power on. The reset button will also cause this behaviour because pulling the reset pin to ground causes a POWERON_RESET event.

A simple workaround for most users affected by this issue is as follows:

#include <rom/rtc.h>

void setup() {
  if (rtc_get_reset_reason(0) == POWERON_RESET) {
  {
    ESP.restart();
  }
  /* ... */
}

This detects the power-on reset event and automatically restarts the ESP device, skipping waiting for the spurious WDT reset.

The problem I see with integrating this fix directly into the arduino-esp32 core is that it breaks user code that reads the reset reason and expects to see a power-on reset. In most cases such user code would be broken by the spurious WDT reset anyway, since they would still see the different reset reason after the spurious WDT reset, but I'm hesitant to suggest including any solution that masks the POWERON_RESET completely since there are edge cases (e.g. user code that initialises EEPROM contents when the reset reason is POWERON_RESET) that would still work despite the spurious WDT issue, but would be broken by a fix that doesn't send it.

As far as I know there's no generic way to store state past the reset (short of using EERPROM, which is a terrible idea) to indicate that a "workaround reset" was caused, so there's also no way to detect that this just happened and fake the result of rtc_get_reset_reason accordingly.

Bits [2] and [11:9] of the EFUSE_BLK0_RDATA3_REG register store pkg_version, which may be of use in detecting revision 0 devices. See pages 518-520 of the ESP32 technical reference manual for details. The package version can be detected as follows:

#define EFUSE_BLK0_RDATA3_REG 0x3FF5A00C
// pkg_version[2:0] = EFUSE_BLK0_RDATA3_REG[11:9]
#define PKG_VERSION_REG_LOW_BITMASK (0b111<<9)
#define PKG_VERSION_REG_LOW_SHIFT 9
// pkg_version[3] = EFUSE_BLK0_RDATA3_REG[2]
#define PKG_VERSION_REG_HIGH_BITMASK (1<<2)
#define PKG_VERSION_REG_HIGH_SHIFT -1

uint8_t get_esp32_package_revision()
{
  uint32_t blk0_rdata3 = REG_READ(EFUSE_BLK0_RDATA3_REG);
  uint8_t pkg_version =
    ((blk0_rdata3 & PKG_VERSION_REG_LOW_BITMASK) >> PKG_VERSION_REG_LOW_SHIFT) |
    ((blk0_rdata3 & PKG_VERSION_REG_HIGH_BITMASK) >> PKG_VERSION_REG_HIGH_SHIFT);
  return pkg_version;
}

I have tested this on my affected devices and it does indeed return 0; further testing would be required on newer boards to ensure that this is functioning as expected.

Being able to detect revision 0 devices is at least a step in the right direction. At minimum there should be a debug message at boot warning users of this issue, perhaps with a link to a page describing the problem and some fixes. I am willing to help write this documentation.

In summary, I believe the following steps would be helpful going forward:

Validate that the get_esp32_package_revision() code above behaves as expected on revision 0 and revision 1 devices (and others if they exist) in the hopes of detecting the likelihood of this issue at runtime.
Validate that the reported package version value holds a strong correlation with the behaviour described in the ESP32 errata.
Discuss potential strategies for automatic mitigation of the issue, in the hopes of coming up with something that doesn't break user code on either revision 0 or 1 devices.
Discuss alternative compile-time mitigation strategies (e.g. a workaround enabled by a preprocessor directive, perhaps set at the board level for known-affected development boards) if a sufficiently safe automatic mitigation cannot be devised.
Determine which open issues on GitHub are likely to be partially or wholly due to this silicon bug, and communicate the results of the aforementioned discussions in reply to those issues.
Add debug output at boot that warns users when they are running hardware that is known to be affected by this bug. Include sufficient information (ideally both a unique search term and URL to documentation) for a user to be able to find out more about the issue and work around it.
Write up documentation that covers the specifics of the issue, available workarounds and their side-effects/caveats, and a summary of the decisions that were made regarding a fix.

As noted before, I'm happy to help with as much of this as I reasonably can. I would also appreciate as much input as possible from more seasoned ESP32 developers and project maintainers, since this issue clearly affects a large number of users and I am hesitant to make potentially breaking changes in the code.

gsuberland commented 3 years ago

As a quick addendum, I have checked over a number of devices. Here are my findings:

DOIT ESP 32 DevKit V1

Three separate "DOIT ESP 32 DevKit V1" boards, purchased August 2020 from AliExpress, using the earlier style of modules marked "ESP-WROOM-32". Reports package version 0, has the reboot issue.

NodeMCU-32S (clone?)

A generic NodeMCU-32S style board, possibly a clone, with no markings other than "ESP32S" on the back. Uses the newer style of module marked "ESP32-WROOM-32", with the Espressif logo. Unknown purchase date. Reports package version 0, has the reboot issue.

TTGO LoRa32 915MHz

Purchased from AliExpress in June 2020. Marked T3_V1.6 20180606. Uses a bare ESP32-PICO-D4 chip rather than a module. Reports package version 5, does not exhibit the issue.

ESP32 LyraT Mini (v1.2)

This is an official Espressif dev board. Ordered from Mouser in February 2020. Uses an ESP32-WROVER-B module with the new style markings. Reports package version 1, does not exhibit the issue.

This initial investigation lends credence to the idea that we could detect the bug at runtime based on the package version number reported by the device. So far only package version 0 devices have exhibited the reboot issue.

igrr commented 3 years ago

Hi @gsuberland, thanks for taking time to write a comprehensive issue report! I have a few questions to clarify the behavior you are observing.

Device is powered on, reset via the button, or flashed Serial output comes through An attempt is made to use some functionality (e.g. WiFi) That attempt hangs or otherwise misbehaves The device resets roughly 10 seconds after boot

The sequence of steps you describe is not consistent with the "spurious WDT reset" issue in ESP32 rev.0 silicon. The spurious WDT reset happens almost immediately after the power-on reset, the serial output looks like this:

ets Jun  8 2016 00:22:57

rst:0x1 (POWERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
flash read err, 1000
ets_main.c 371 
ets Jun  8 2016 00:22:57

rst:0x10 (RTCWDT_RTC_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00

No application code can be executed between the first boot attempt (with POWERON_RESET reason) and the 2nd one (with RTCWDT_RTC_RESET reason).

I think the description of this bug in the errata is not very clear — the WDT reset is only the appearance, not the root cause of the issue. The root cause is that the MMU is left in an incorrect/unexpected state after the power-on reset, so the ROM bootloader can not load the application and the chip eventually gets reset. For most intents and purposes, RTCWDT_RTC_RESET works the same way as the POWERON_RESET; except for that MMU issue!

For this reason, I think that the following workaround suggestion is not applicable, as the application code never gets executed until after the WDT reset:

This detects the power-on reset event and automatically restarts the ESP device, skipping waiting for the spurious WDT reset.

To put it another way, I think the spurious WDT reset may be a red herring when it comes to Wi-Fi connection issue you have linked.

During first boot after power on, access to the WiFi peripheral is flaky.

I have browsed through the linked issue and then through our internal ESP-IDF issue tracker, but failed to find any similar sounding issue filed and fixed in IDF. Together with this comment: https://github.com/espressif/arduino-esp32/issues/2501#issuecomment-708704583 it makes me think the issue is actually related to the interaction between the Wi-Fi driver and Arduino core.

On the chip where you observe the issue, does the behavior change if you wipe the contents of SPI Flash and then load the application again? You can use esptool.py erase_flash command to try this. Just in case it does, running esptool.py read_flash first to dump the "known-bad" content might also be useful.

gsuberland commented 3 years ago

Thanks for the quick response.

What completely baffled me today is that upon re-testing all of my boards, none of them are exhibiting the issue. Nothing was re-flashed, nothing changed, I'm using the exact same USB cable, and I didn't even exit the Arudino IDE between reporting the issue and now.

I thought that perhaps the issue was related to the board profile I had chosen, but after flashing the same code to one of the DOIT boards using the correct board profile, the Node32s profile, and the NodeMCU-32S profile, it still doesn't exhibit the original fault. I considered that I had perhaps been hitting the reset button at the wrong moment at the start of programming (discovered today I don't need to press it at all - whoops!) but that did not change the behaviour either.

After some further investigation I believe I have discovered the fault, and it was mine. The sketch I had uploaded when seeing the fault had a custom memory allocator as part of it, which had an uninitialised memory bug. This explains why I was seeing the fault on power-on, but not after the WDT reset, and why I am not seeing the fault now.

On first boot the RAM contents would be random and non-zero. The code was executing under the expectation that the memory would be zeroed, but this was not the case so it clobbered data at some random address. This did not cause sufficient damage to immediately fault or cause heap corruption detection to kick in, so my code would continue to run. Code elsewhere in the memory allocator later initialised the previously accessed memory to zeroes. At this point the system seemed stable but was actually in a time bomb state. As soon as any peripheral or functionality was accessed that relied upon the randomly clobbered address, it would lock up and trigger a WDT reset. However, since the WDT reset does not clear memory contents, the previously zeroed memory block would almost certainly be reused and lead to successful behaviour.

While uninitialised memory values should have been completely random each time, causing different behaviour, it seems that uninitialised memory state is sufficiently deterministic over a short time under the same environmental conditions that similar values were appearing at power on / reset each time. Something as simple as a difference in environmental temperature may have altered the charge leakage rate in the SRAM cells sufficiently to cause a different value to be present at power on, causing a different memory address to be modified when I tried today than when I was running into the problem before.

After patching the bug the crash has gone away completely.

Interestingly, I did still see the behaviour with the WiFi crashing using a minimal test sketch that did not contain the faulty code, uploaded after the initial buggy memory manager had run on the device at least once, but without disconnecting the power from the device completely. This may indicate that something in the WiFi stack is suffering from an uninitialised memory bug very similar to the one I just discovered in my own code. Perhaps this may guide further analysis work.

Anyway, apologies for the diversion! I really should've checked all my code more thoroughly before reporting this, but I thought my minimal test sketch was sufficient evidence of it being something in the silicon or core code.

espressif / arduino-esp32