esp-rs / esp-wifi-sys

Wi-Fi and BT drivers packaged for integration into bare-metal esp-wifi.
Apache License 2.0
399 stars 93 forks source link

esp32/embassy: stability issues #226

Closed manio closed 1 year ago

manio commented 1 year ago

Hi! I am currently testing my DevKitC ESP-WROOM-32U. After some consideration I've chosen the no-std and esp-wifi with the embassy. Generally speaking my project is compiling and "quite" working. My main problem is stability. Sometimes it is working for several minutes, sometimes even some hours. I even enabled the watchdog - and it is resetting correctly when the problem occurs but this is rather a workaround for the problem.

As a last resort I just compiled and flashed a pure esp-wifi example (embassy_dhcp) to eliminate the factor of something wrong with my code. The result is unfortunately the same - the device is working for several hours max. Then it stops working (I believe that the embassy async tasks just stop working) - so the ping reply also stops. The chip is not hot (but It has to work with full CPU speed because of wifi).

I understand that there are some glitches/instability issues at this stage of adoption.

I'll be glad if you can confirm/deny this problem on your environment/devices - in other words: I am just wondering if this is only affecting me (or my device) - or this is rather some general problem which will hopefully be fixed in the future versions...

btw: when the program is starting I've got:

WARN - esp_wifi_internal_tx 12290
WARN - Unknown wifi mode in link_state

I hope this is OK. Once during my testing I have this line in the log: INFO - Unhandled event: StaBeaconTimeout

When starting I've got:

I (27) boot: ESP-IDF v5.1-dev-1498-g2d9e75bf05 2nd stage bootloader
I (27) boot: compile time Oct 28 2022 14:39:19
I (29) boot: chip revision: V300
I (33) boot_comm: chip revision: 3, min. bootloader chip revision: 0
I (40) boot.esp32: SPI Speed      : 40MHz
I (45) boot.esp32: SPI Mode       : DIO
I (49) boot.esp32: SPI Flash Size : 4MB
I (54) boot: Enabling RNG early entropy source...
I (59) boot: Partition Table:
I (63) boot: ## Label            Usage          Type ST Offset   Length
I (70) boot:  0 nvs              WiFi data        01 02 00009000 00006000
I (78) boot:  1 phy_init         RF data          01 01 0000f000 00001000
I (85) boot:  2 factory          factory app      00 00 00010000 003f0000
I (93) boot: End of partition table
I (97) boot_comm: chip revision: 3, min. application chip revision: 0
I (104) esp_image: segment 0: paddr=00010020 vaddr=3f400020 size=12ee4h ( 77540) map
I (140) esp_image: segment 1: paddr=00022f0c vaddr=3ffb0000 size=010b4h (  4276) load
I (142) esp_image: segment 2: paddr=00023fc8 vaddr=3ffbc9bc size=00190h (   400) load
I (146) esp_image: segment 3: paddr=00024160 vaddr=40080000 size=0ad10h ( 44304) load
I (173) esp_image: segment 4: paddr=0002ee78 vaddr=00000000 size=011a0h (  4512)
I (175) esp_image: segment 5: paddr=00030020 vaddr=400d0020 size=5a5bch (370108) map
I (317) boot: Loaded app from partition at offset 0x10000
I (317) boot: Disabling RNG early entropy source...

Steps to reproduce:

  1. Compile and run the embassy_dhcp example: cargo run --example embassy_dhcp --release --features "async,embedded-svc,wifi,embassy-net" Note: I disabled the TCP connection to remote server like this:
    
    diff --git a/examples-esp32/examples/embassy_dhcp.rs b/examples-esp32/examples/embassy_dhcp.rs
    index f002e90..fb06b36 100644
    --- a/examples-esp32/examples/embassy_dhcp.rs
    +++ b/examples-esp32/examples/embassy_dhcp.rs
    @@ -148,7 +150,7 @@ async fn task(stack: &'static Stack<WifiDevice<'static>>) {
     loop {
         Timer::after(Duration::from_millis(1_000)).await;

@@ -184,5 +186,6 @@ async fn task(stack: &'static Stack<WifiDevice<'static>>) { println!("{}", core::str::from_utf8(&buf[..n]).unwrap()); } Timer::after(Duration::from_millis(3000)).await;

bjoernQ commented 1 year ago

I tested with the blocking API and have the ESP32 connected for more than 17h now without problems. So, it's not really normal. I also had (some time ago) a temperature logging application running for several days without issues.

I wouldn't really expect using async to be the problem but maybe worth a try.

In general, the examples are kept simple - they don't do re-connects or re-tried connects.

Can you get some more information from your access-point's logs? (e.g. a disconnect reason)

What power source do you use for the ESP32? A bad power source can definitely cause WiFi issues

manio commented 1 year ago

@bjoernQ Thanks for your tips.

Most of the development time I am using a direct USB connection and the ESP-WROOM-32U is powered from the PC; but I was suspecting the power supply and I just connected the device once to the separated lab power supply with the same result. Nevertheless I think that I could give it another try...

When it stops working it also stops printing my debug lines from every async task (even not wifi-related) with println! on my espflash monitor, so my AP logs are probably not very useful, as I can clearly see on the console, that the device has stopped working.

First I think I'll try the blocking code and see if it helps. I also find some problem with the wifi init described in issue: https://github.com/esp-rs/esp-wifi/issues/214, so maybe this is some tip for me...

manio commented 1 year ago

I just started the regular (blocking) dhcp example... will see how long it will run.

btw: I was also trying to update to the latest esp-hal github version to see if this helps, so I updated my dependencies in Cargo to git = "https://github.com/esp-rs/esp-hal" instead of 0.13.0, but this is giving me the following error:

error: One of the feature flags must be provided: embassy-time-systick, embassy-time-timg0, 
   --> /home/manio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/esp-hal-common-0.10.0/build.rs:27:9
    |
27  |         compile_error!(concat!("One of the feature flags must be provided: ", $($all, ", "),*));
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
162 |         assert_unique_used_features!("embassy-time-systick", "embassy-time-timg0");
    |         -------------------------------------------------------------------------- in this macro invocation
    |
    = note: this error originates in the macro `assert_used_features` which comes from the expansion of the macro `assert_unique_used_features` (in Nightly builds, run with -Z macro-backtrace for more info)

Do you have any idea how to fix this? I still have the features set for the esp32-hal crate: features = ["embassy", "embassy-time-timg0", "async"]

manio commented 1 year ago

@jessebraham, @MabezDev: do you guys maybe know how to fix the above compilation error?

My complete dependencies set is currently:

[dependencies]
embassy-executor  = { version = "0.2.0", features = ["nightly", "integrated-timers", "arch-xtensa", "executor-thread"] }
static_cell = "1.1.0"    
esp-wifi = { git = "https://github.com/esp-rs/esp-wifi", features = ["esp32", "embassy-net", "embedded-svc", "wifi"] }
embassy-time = { version = "0.1.2",  features = ["nightly", "tick-hz-1_000_000"] }
embassy-net = { version = "0.1.0", features = [ "nightly", "tcp", "udp", "dhcpv4", "medium-ethernet" ] }
hal = { package = "esp32-hal", git = "https://github.com/esp-rs/esp-hal", features = ["embassy", "embassy-time-timg0", "async"] } 
esp-backtrace = { version = "0.7.0", features = ["esp32", "panic-handler", "exception-handler", "print-uart"] }
esp-println = { version = "0.5.0", features = ["esp32", "log"] }
embedded-svc = { version = "0.25.3", default-features = false }
embedded-hal-async = { version = "=0.2.0-alpha.2"} 
log = "0.4.16"
embedded-io = "0.4.0"
format_no_std = "1.0.0"
bjoernQ commented 1 year ago

I just started the regular (blocking) dhcp example... will see how long it will run.

btw: I was also trying to update to the latest esp-hal github version to see if this helps, so I updated my dependencies in Cargo to git = "https://github.com/esp-rs/esp-hal" instead of 0.13.0, but this is giving me the following error:

error: One of the feature flags must be provided: embassy-time-systick, embassy-time-timg0, 
   --> /home/manio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/esp-hal-common-0.10.0/build.rs:27:9
    |
27  |         compile_error!(concat!("One of the feature flags must be provided: ", $($all, ", "),*));
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
162 |         assert_unique_used_features!("embassy-time-systick", "embassy-time-timg0");
    |         -------------------------------------------------------------------------- in this macro invocation
    |
    = note: this error originates in the macro `assert_used_features` which comes from the expansion of the macro `assert_unique_used_features` (in Nightly builds, run with -Z macro-backtrace for more info)

Do you have any idea how to fix this? I still have the features set for the esp32-hal crate: features = ["embassy", "embassy-time-timg0", "async"]

It's better to patch the dependencies like

[patch.crates-io]
esp-hal-common = {  git = "https://github.com/esp-rs/esp-hal" }
esp32-hal = {  git = "https://github.com/esp-rs/esp-hal" }
esp32s3 = { git = "https://github.com/esp-rs/esp-pacs", rev = "ce6d5b5" }
manio commented 1 year ago

@bjoernQ Thanks, with this dependencies patch it is compiling fine :) I didn't test the binary though, as I am still testing the dhcp blocking example (so far so good... :)

manio commented 1 year ago

Ok - I've tested it during the whole night (blocking dhcp example):

--- esp ping statistics ---
53863 packets transmitted, 53856 received, 0.0129959% packet loss, time 53895027ms
rtt min/avg/max/mdev = 3.665/13.394/82.319/3.590 ms

almost 15h without any problems and it was still working :)

Then I switched back to embassy_dhcp and it hangs after 3h 43m.

Do you maybe have some "spare" ESP32 to test the embassy_dhcp on?

bjoernQ commented 1 year ago

Ok just repurposed that ESP32 for the embassy_dhcp test. The blocking code was still working after more than 45h

Let's see how that will work for me

bjoernQ commented 1 year ago

Ok ... seems it already died for me. So, I'm pretty sure there is some instability specific to the async side of things

manio commented 1 year ago

Oh - so this is good, that it is not specific to me only :)

I've just updated my main program using your [patch.crates-io], specifically esp-hal from 0.13.0 to git version, embassy-time from 0.1.1 to 0.1.2 and I also did cargo update to have fresh git version of esp-wifi and esp-hal*

plus one API change:

diff --git a/src/main.rs b/src/main.rs
index 842bfb7..955d17c 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -155,7 +155,7 @@ fn main() -> ! {
     )
     .unwrap();
     let (wifi, _) = peripherals.RADIO.split();
-    let (wifi_interface, controller) = esp_wifi::wifi::new_with_mode(&init, wifi, WifiMode::Sta);
+    let (wifi_interface, controller) = esp_wifi::wifi::new_with_mode(&init, wifi, WifiMode::Sta).unwrap();

     let timer_group0 = TimerGroup::new(peripherals.TIMG0, &clocks, &mut system.peripheral_clock_control);
     let mut wdt = timer_group0.wdt;

... and I am testing it - it died very shortly before that changes, but now it seems it is working - I'll see how long...

manio commented 1 year ago

Nope :( I was hoping that raising dependencies could fix this problem but after 5 hrs it also died :(

manio commented 1 year ago

Are we able to bisect it somehow? (I know it would be a pain).... or at least review a git log of the async stuff... We have the "bad" but what about the good one? This is maybe not a rule but I observed during my development, that adding more async tasks - the better chance to get the "freeze" sooner then later...

bjoernQ commented 1 year ago

Ok let's summarize what we know / don't know

What we know

What we don't know

cc @MabezDev @jessebraham any ideas or did you ever experience problems with async in general

bjoernQ commented 1 year ago

I setup an ESP32-C3 with the modified embassy_dhcp example - let us see if it runs into the same problem

bjoernQ commented 1 year ago

Seems the C3 also died - seems it lasted slightly longer for me but still just roughly two hours

bugadani commented 1 year ago

This probably isn't specific to esp-wifi. I've experienced random lockups before I added wifi to my project, but in my case my MCU is only ever running for a few minutes at a time. My last test ran for about 20 minutes before I stopped it, I can probably set up something longer term, too.

In my case, the problem seemed to go away by itself, but maybe it really didn't.

bjoernQ commented 1 year ago

This probably isn't specific to esp-wifi. I've experienced random lockups before I added wifi to my project, but in my case my MCU is only ever running for a few minutes at a time. My last test ran for about 20 minutes before I stopped it, I can probably set up something longer term, too.

In my case, the problem seemed to go away by itself, but maybe it really didn't.

Interesting! I tried running an async blinky with some additional tasks also just using timers but maybe that was too simple and worked for a few hours before I needed the dev-board for something else. If you find a way to repro a similar thing without esp-wifi that would be very interesting.

bugadani commented 1 year ago

an async blinky

My old hypothesis was that the firmware might be missing interrupts in some currently unexplained cases, so to me it's unlikely for a blinky to lock up. My use case is my ecg board on a charger, measuring noise at 1kHz and displaying it at 100Hz, which is probably reasonably hungry and it is also obvious at a glance if it locks up. I will report back if the thing locks up, or if I have to stop because it catches fire.

bugadani commented 1 year ago

One more thing: you'll want to use an embassy-executor that includes this PR on Xtensa: https://github.com/embassy-rs/embassy/pull/1451 The issue it points to have been repurposed since originally opening it, but it is a fix for a potential deadlock I was hitting quite often.

It doesn't explain why the C3 dies, though.

bugadani commented 1 year ago

I have stopped my test after 16 hours. Either the executor I'm using isn't prone to the lockup in this issue, or it really is specific to esp-wifi. While the issue with C3 seems to suggest the latter, it would still be awesome if you could retry an Xtensa reproducer with embassy-executor = "0.2.1", as it fixes a bug that may cause firmware to randomly freeze.

manio commented 1 year ago

On Thu, Aug 10, 2023 at 09:05:19PM -0700, Dániel Buga wrote:

I have stopped my test after 16 hours. Either the executor I'm using isn't prone to the lockup in this issue, or it really is specific to esp-wifi. While the issue with C3 seems to suggest the latter, it would still be awesome if you could retry an Xtensa reproducer with embassy-executor = "0.2.1", as it fixes a bug that may cause firmware to randomly freeze. I thing I could try this later as I'm working in office today...

manio commented 1 year ago

Ok guys, I just want to put an update here:

Till now I was testing the blocking dhcp example to make sure if it really don't freeze at some point. And indeed it looks it is working perfectly fine. My ping stats are:

--- esp ping statistics ---
195443 packets transmitted, 195412 received, 0.0158614% packet loss, time 195565305ms
rtt min/avg/max/mdev = 3.750/14.138/356.586/4.744 ms

Which means it was running fine for over 2 days and 6 hours until I stopped it now.

I was trying to run the embassy_dhcp example with embassy-executor set to 0.2.1 as @bugadani suggests, but cargo screams at me about dependency problem (probably with some other crates, as it's stick to specific version in the main esp-wifi repository); so I just upgraded it in my own project (which has more recent git revisions of dependent crates) and It compiled fine.

So now I am testing my program based on embassy_dhcp + additional task which is reading/writing the UART...

manio commented 1 year ago

Hi, It look like you tracked this problem down! :) My ESP32 is still working, it is about 14h now. I think this indeed was a problem. Thank you very much, guys! I'm ok if you close this.

@bugadani Just curious: the https://github.com/embassy-rs/embassy/pull/1451 was for a regression or this problem was there from the beginning?

bugadani commented 1 year ago

The problem was there from the start, I believe. The code was perfectly good for single-core MCUs, but some of the particulars of a dual-core (like the original ESP32, or the S3) weren't considered for reasons that are probably lost in time at this point. This isn't exactly unexpected. What is somewhat surprising to me, though, is that it took almost a year to recognise and diagnose it. (And another 3 months to release the fix, but that's mostly my forgetfulness 🙈)

etiennetremel commented 1 year ago

Upgrading embassy-executor to v0.2.1 fixed the instability for me. @bugadani, thanks for pointing out the solution!