espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.76k stars 7.3k forks source link

BT Nimble LE pairing takes too long - watchdog timer timeout (IDFGH-13572) #14460

Closed eriksl closed 1 month ago

eriksl commented 2 months ago

Answers checklist.

IDF version.

v5.3

Espressif SoC revision.

ESP32-S3

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

Irrelevant here

Power Supply used.

USB

What is the expected behavior?

I expect the pairing to complete, as it has done previously, using both Linux and Android host device.

What is the actual behavior?

Watchdog timer timeout during. The timeout does occur deep inside the IDF code, somewhere in mbedtls, it's not in my code.

Steps to reproduce.

  1. Setup BT GATT on ESP32-S3
  2. Try to pair (with no pre-existing bonds)
  3. Wait long, over 5 seconds.

Debug Logs.

No response

More Information.

When I increase the watchdog timer timeout to 10 seconds, the issue is gone. Well, it's worked around. I don't think a) pairing should take this long b) even if it takes so long, the watchdog timer is never fed

rahult-github commented 2 months ago

Hello @eriksl ,

Can you possibly give more information on how to reproduce the issue .

Setup BT GATT on ESP32-S3 Try to pair (with no pre-existing bonds) Wait long, over 5 seconds.

  1. By setup BT GATT , which application are you running on S3 , is it bleprph ?
  2. Try to pair --> Who is initiating the pairing and by which method ? ( e.g. Bond option on third party phone like nrfconnect ? Am assuming no change is done in default IOCAP on S3 side ? Please share your sdkconfig for this .
  3. Wait long, over 5 seconds --> What is the remote device being used ? Is there no event received on S3 side. Can you please enable Nimble Debug logs and share console log too ? The way to enable debug logs :

Component config → Log output → Default log verbosity --> Set to Debug Component config → Bluetooth → NimBLE Options → NimBLE Host log verbosity --> Set to Debug

eriksl commented 2 months ago

I am running my own application, https://github.com/eriksl/esp32 but not all required information to set up Nimble is available in the documentation and I had to guess some things from the examples like blerph. I will make sure the repository on github is up to date at the end of the day, right now it's a bit outdated due to holidays and a major overhaul + IDF upgrade.

The ESP32 is the peripheral, the Linux host is the central. The central initiates the pairing. It uses whatever method the peripheral offers. To rule out an issue with Bluez I also tried pairing using Android and that gives exactly the same result. Please note, you mention nRF, but the issue is way before the point where an application can connect. It's really during pairing.

IOCAP of course hasn't changed. Actually none of the BT setup code has changed.

The remote is waiting for the S3, not the other way around! The S3 seems to be busy with "something" and finally gives up when the watchdog timer intervenes.

I don't think the debug log will yield anything useful, but the backtrace from the watchdog timer will be very informative I think. The intervention always takes place at the same point in the code, so it looks like the processor stalls there for some reason. Look like some function to handle hardware offload from within MBEDTLS.

I'll attach the relevant code on the ESP32-S3 code, the stacktrace and the config.

eriksl commented 2 months ago

Relevant sources. bluetooth-src.zip

eriksl commented 2 months ago

sdkconfig sdkconfig.zip

eriksl commented 2 months ago

Error log (stderr and stdout combined)

E (54650) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time: E (54650) task_wdt: - IDLE0 (CPU 0) E (54650) task_wdt: Tasks currently running: E (54650) task_wdt: CPU 0: nimble_host E (54650) task_wdt: CPU 1: IDLE1 E (56138) task_wdt: Aborting. E (56138) task_wdt: Print CPU 0 backtrace

Two backtraces, slightly different.

One

0x4209f8f6: mbedtls_mpi_core_sub at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/bignum_core.c:451 (discriminator 1)
0x42052889: mbedtls_mpi_sub_abs at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/bignum.c:1089
0x420528f3: add_sub_mpi at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/bignum.c:1121
0x42052956: mbedtls_mpi_sub_mpi at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/bignum.c:1154
0x42054375: mbedtls_mpi_sub_mod at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecp.c:1084
0x42054812: ecp_double_jac at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecp.c:1512
0x42055135: ecp_precompute_comb at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecp.c:1919
 (inlined by) ecp_mul_comb at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecp.c:2354
 (inlined by) ecp_mul_restartable_internal at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecp.c:2688
0x42055623: mbedtls_ecp_mul_restartable at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecp.c:2722
0x42054258: ecdh_compute_shared_restartable at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecdh.c:105
0x42054296: mbedtls_ecdh_compute_shared at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecdh.c:129
0x4201fef5: ble_sm_alg_gen_dhkey at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_sm_alg.c:578
0x42018926: ble_sm_sc_public_key_rx at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_sm_sc.c:636
0x42013865: ble_sm_rx at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_sm.c:2952
0x420149bb: ble_hs_hci_evt_acl_process at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_hs_hci_evt.c:1195
0x42014110: ble_hs_process_rx_data_queue at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_hs.c:248
0x4201411b: ble_hs_event_rx_data at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_hs.c:545
0x4209e9ab: npl_freertos_event_run at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/porting/npl/freertos/src/npl_os_freertos.c:451
0x4037a969: ble_npl_event_run at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/porting/npl/freertos/include/nimble/nimble_npl_os.h:185
 (inlined by) nimble_port_run at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/porting/nimble/src/nimble_port.c:308
0x4200dff9: nimble_port_task at /data/src/intern/esp/esp32/s3/develop/main/bt.c:94

Two

0x4209f86a: mbedtls_mpi_core_bitlen at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/bignum_core.c:65
0x420526a5: mbedtls_mpi_bitlen at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/bignum.c:474
0x420571fd: mbedtls_mpi_mul_mpi at /home/erik/.espressif/esp-idf/components/mbedtls/port/bignum/esp_bignum.c:490
0x420573cb: mbedtls_mpi_mul_int at /home/erik/.espressif/esp-idf/components/mbedtls/port/bignum/esp_bignum.c:572
0x42052c41: mbedtls_mpi_div_mpi at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/bignum.c:1475
0x42052cdb: mbedtls_mpi_mod_mpi at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/bignum.c:1543
0x420543c2: ecp_modp at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecp.c:1004
 (inlined by) mbedtls_mpi_mul_mod at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecp.c:1062
0x420548ed: ecp_double_jac at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecp.c:1551
0x42055135: ecp_precompute_comb at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecp.c:1919
 (inlined by) ecp_mul_comb at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecp.c:2354
 (inlined by) ecp_mul_restartable_internal at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecp.c:2688
0x42055623: mbedtls_ecp_mul_restartable at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecp.c:2722
0x42054258: ecdh_compute_shared_restartable at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecdh.c:105
0x42054296: mbedtls_ecdh_compute_shared at /home/erik/.espressif/esp-idf/components/mbedtls/mbedtls/library/ecdh.c:129
0x4201fef5: ble_sm_alg_gen_dhkey at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_sm_alg.c:578
0x42018926: ble_sm_sc_public_key_rx at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_sm_sc.c:636
0x42013865: ble_sm_rx at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_sm.c:2952
0x420149bb: ble_hs_hci_evt_acl_process at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_hs_hci_evt.c:1195
0x42014110: ble_hs_process_rx_data_queue at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_hs.c:248
0x4201411b: ble_hs_event_rx_data at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/nimble/host/src/ble_hs.c:545
0x4209e9ab: npl_freertos_event_run at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/porting/npl/freertos/src/npl_os_freertos.c:451
0x4037a969: ble_npl_event_run at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/porting/npl/freertos/include/nimble/nimble_npl_os.h:185
 (inlined by) nimble_port_run at /home/erik/.espressif/esp-idf/components/bt/host/nimble/nimble/porting/nimble/src/nimble_port.c:308
0x4200dff9: nimble_port_task at /data/src/intern/esp/esp32/s3/develop/main/bt.c:94
rahult-github commented 2 months ago

Hi @eriksl ,

the backtrace shows mbedtls functions which , i guess, are causing infinite loop and hence the issue.

Can you please share the way to compile the code against IDF SDK you have shared ( either the bluetooth-src.zip or even the github repo) . I tried few ways like making folder inside IDF or exporting paths, but it fails.

I took the init content from bt.c and replicated it in SDK sample bleprph example. I tested against phone and bluez and both worked fine. So, if you can share your compilation setup against IDF SDK to get the application compiled in its entirety , it would help us to follow the same steps and debug more by reproducing issue locally.

idf.txt

bluez.txt

Changed example file for reference: main.txt

eriksl commented 2 months ago

My github repository is now completely up to date and should build cleanly now.

https://github.com/eriksl/esp32

Steps to build and test:

That should be sufficient. If there are any difficulties, please let me know and I will fix it.

Build as is now, it should work. If you change the watchdog timer back to the default of 6 seconds, the watchdog timer will go off.

Thank you for looking into this!

Please note there is no endlessly looping code (either in my app or in mbedtls) because the pairing will succeed if given enough time (set watchdog timer to 10 seconds).

eriksl commented 2 months ago

Update, please clone this repository again or pull it (although cloning is probably safest). I have managed to make it ESP32-sub-type and board-type independent without the symlinks kludge. It now looks like a completely normal project, with just two extra menuconfig entries. The board selection currently does not have any influence on the BT code, so you can choose any of the two you like.

I don't expect any building issues now.

rahult-github commented 2 months ago

Compilation is ok now . Thanks.

but the chip fails to boot.

--- esp-idf-monitor 1.4.0 on /dev/ttyUSB0 115200 --- --- Quit: Ctrl+] | Menu: Ctrl+T | Help: Ctrl+T followed by Ctrl+H --- ESP-ROM:esp32s3-20210327 Build:Mar 27 2021 rst:0x1 (POWERON),boot:0x8 (SPI_FAST_FLASH_BOOT) SPIWP:0xee mode:DIO, clock div:1 load:0x3fce2820,len:0x1988 load:0x403c8700,len:0x4 load:0x403c8704,len:0xda0 load:0x403cb700,len:0x3518 entry 0x403c8918

The menuconfig has many changes compared to default example sdkconfig. I will try to figure out which option is causing this and check more

eriksl commented 2 months ago

Interesting. It will probably have to do with me using a Wemos Lolin S3-mini that has a S3-H4R2 (4 MB flash and 2 MB SPIRAM on-package, quad) and I have them "hard" configured in the sdkconfig, I am afraid it can't be done otherwise without issues. If you can get a similar setup somewhere, I guess it could work. Maybe you need to the chip type to autodetect.

If this is a problem I may be able to make a version that runs without SPIRAM.

Sorry for the inconvencience.

eriksl commented 2 months ago

This may have to do with me having all MBEDTLS memory allocated from SPIRAM (quad, 80 MHz). Apparently the calculation being performed is kind of heavy, possibly heavy on RAM as well.

The routine is NOT stuck in a loop, it will finish eventually, but no sooner than after ten seconds...

rahult-github commented 2 months ago

This may have to do with me having all MBEDTLS memory allocated from SPIRAM

Ok. This can be possible. Do you observe issue if SPIRAM support is disabled and internal memory is used ?

eriksl commented 2 months ago

I can test this partially. I have remove the bonding from the central and then there is no problem. I can't remove the bonding from the client though, I would need to invalidate all of the NVS and it would also destroy all of my configuration variables.

May I run into this I will test again.

rahult-github commented 1 month ago

Closing this issue currently. Feel free to reopen in case of any further updates.