espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.89k stars 7.33k forks source link

BT Controller - Stops Scanning or responding after random amount of time (IDFGH-1781) #4001

Open 0neblock opened 5 years ago

0neblock commented 5 years ago

Brief

I have been having a problem with the Bluedroid BT Controller Scanning function for a few weeks now, and after trying many different things, I am stuck and am not sure what else I can try. The crux of the issue is that the BLE Scanning feature will work for a large amount of time - up to 3 days, then just fail silently, with the whole BT controller seemingly shutting down.

Problem Description

I have a BLE Scanning app that is working well for the most part. It spends most of its time performing an active scan for other BLE devices that are advertising a service UUID and some custom manufacturer data. It receives an advertisement from a sensor around every 1 second, but I can have anywhere from 1-10 sensors within range at any one time.

After a completely random period of time, sometimes 20 minutes, sometimes 3 days. The App will stop receiving ESP_GAP_SEARCH_INQ_RES_EVT events from the bt layer, even though it should still be receiving advertisements form multiple devices, with no indication from any underlying BT Controller debugging that anything has happened. This happens no matter how many sensors I have within range of the ESP, advertising the device, it even happens when I have no sensors advertising, and the general BVLE background advertisements are relatively low.

The free heap memory of the app stays the same (~140kB free memory at any one time), so I can rule out a memory leak on the app side, and the rest of the application keeps running normally, albeit with more computation time from the RTOS (indicated by a loop counter that increases when this error happens), So clearly some of the BT Tasks have stopped running.

When the error happens, I can also see that the ESP itself DOES STOP performing Active scanning, as The sensors I use flash an LED whenever they receive a SCAN_REQUEST from the ESP32 Hardware MAC Address, and this stops happening as soon as the error starts.

If I try and recover from the error, by issuing a command such as esp_ble_gap_start_scanning() - which responds ESP_OK, I get a HCI timeout error printed: BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c. At the moment, trying to perform a bt command after the error, and getting this response, is the only indication from the application that something has gone wrong.

I am not using any WiFi functions, so to reduce memory footprint and file size, I have changed the linker script to only include the following libraries in the component.mk of esp32: core rtc phy instead of the usual: core rtc net80211 pp wpa smartconfig coexist wps wpa2 espnow phy mesh

Coredump

coredump This is a coredump taken about 20 minutes after the error occured. I forced this core dump to log by deliberately throwing an IntegerDivideByZero Exception in another task. My hope here is that it saved the task state of the BT tasks, which your team can use internally to see the task state. If you require My APP ELF I can provide this by email.

Debug Log

This is a log showing the lack of errors I receive when the error happens. As you can see, the application was running for 2.5 days before the error occured. The 'BMS' TAG is my application, and the 'Scanning started' and 'scanning stopped' logs are when my app receives the ESP_GAP_SEARCH_INQ_CMPL_EVT and ESP_GAP_BLE_SCAN_START_COMPLETE_EVT events respectively. In this application, I start a esp_ble_gap_start_scanning operation of 30 seconds, and when I receive a ESP_GAP_SEARCH_INQ_CMPL_EVT event, i set a flag to restart the esp_ble_gap_start_scanning of 30 seconds, in a cycle. Although as discussed later, I have tried changing this interval to anywhere from 30 seconds to 5 minutes, and I have also tried setting the interval to 0 for unlimited, so I only call the start_scan once. In this instance, my pplication received the ESP_GAP_SEARCH_INQ_CMPL_EVT event, so set a flag internally to call esp_ble_gap_start_scanning(30) again, which responded with ESP_OK, but I never received the ESP_GAP_BLE_SCAN_START_COMPLETE_EVT, and about 8 seconds later, I see an error log of command timeout.

I (210356607) BMS[scanning]: Scanning Stopped
I (210356607) BMS[scanning]: Scanning started
I (210363327) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 21, lps: 248
I (210373357) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 20, lps: 249
I (210383367) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 19, lps: 247
I (210386607) BMS[scanning]: Scanning Stopped
I (210386617) BMS[scanning]: Scanning started
I (210393387) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 21, lps: 248
I (210403407) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 19, lps: 248
I (210413457) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 19, lps: 249
I (210416617) BMS[scanning]: Scanning Stopped
I (210423487) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451220, used: 202452, attempted: 981230 | pps: 17, lps: 250
E (210424617) BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c
I (210433497) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451220, used: 202452, attempted: 981230 | pps: 25, lps: 250

sdkconfig

sdkconfig

Scanning Configuration Used

These are the configurations currently in use, but as you'll see below I have tried many different

static esp_ble_scan_params_t logging_ble_scan_params = {
    .scan_type              = BLE_SCAN_TYPE_ACTIVE,
    .own_addr_type          = BLE_ADDR_TYPE_PUBLIC,
    .scan_filter_policy     = BLE_SCAN_FILTER_ALLOW_ALL, 
    .scan_interval          = 0x100,
    .scan_window            = 0x100,
    .scan_duplicate         = BLE_SCAN_DUPLICATE_DISABLE
};

Changes Attempted

Below is a list of sdkconfig changes of application setup/operation changes that I have tried, with no success , the same thing occurs.

If there is anything else I should try, please let me know.

Apologies for the large Github issue, this error has been troubling me for some time and I would like to know what I can try next. Thank you.

Environment

Key Value
Development Kit Custom Board
Module or chip used ESP32-WROVER-32
IDF version 91f29bef172a082cbd8f0208ed1757ede0e1d635 - tracking release/v3.3
Build System Make
Compiler version crosstool-ng-1.22.0-80-g6c4433a
Operating System macOS
Power Supply external 3.3V
Rokachy commented 2 years ago

We are using 4.3.1 and we are not seeing the issue

On Thu, Dec 30, 2021, 18:53 Juan Ávila @.***> wrote:

Hi! Any update about this? I noticed this same issue when using BLE + Classic (v4.4-beta1). Thanks!

— Reply to this email directly, view it on GitHub https://github.com/espressif/esp-idf/issues/4001#issuecomment-1003106488, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS7K4PPKJ26AMQ5UEROFOM3UTSE7LANCNFSM4ITBESYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

juanaviladev commented 2 years ago

We are using 4.3.1 and we are not seeing the issue

Are you using bluedroid (with BTDM mode)? With only ble scan I can't reproduce it

I'm doing name requests (read_remote_name) and at the same time a ble scan

Rokachy commented 2 years ago

We are using ble scan.

On Fri, Jan 7, 2022, 12:48 Juan Ávila @.***> wrote:

We are using 4.3.1 and we are not seeing the issue … <#m6376715294096464826> On Thu, Dec 30, 2021, 18:53 Juan Ávila @.> wrote: Hi! Any update about this? I noticed this same issue when using BLE + Classic (v4.4-beta1). Thanks! — Reply to this email directly, view it on GitHub <#4001 (comment) https://github.com/espressif/esp-idf/issues/4001#issuecomment-1003106488>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS7K4PPKJ26AMQ5UEROFOM3UTSE7LANCNFSM4ITBESYQ https://github.com/notifications/unsubscribe-auth/AS7K4PPKJ26AMQ5UEROFOM3UTSE7LANCNFSM4ITBESYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.>

Are you using bluedroid (with BTDM mode)? With only ble scan I can't reproduce it

— Reply to this email directly, view it on GitHub https://github.com/espressif/esp-idf/issues/4001#issuecomment-1007312832, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS7K4PI44ASQV3SNT4M75BTUU3AI7ANCNFSM4ITBESYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

ktonder commented 2 years ago

Using the latest Arduino ESP32, v 2.0.2 that uses idf 4.4-alpha1 I also get the error "BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c"

When this happens I also see that the application reconnects to WiFi as I have WiFi / BLE co-existence. BLE scanning for 1 second at the time, then a short break before I start scanning again.

juanaviladev commented 2 years ago

Using the latest Arduino ESP32, v 2.0.2 that uses idf 4.4-alpha1 I also get the error "BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c"

When this happens I also see that the application reconnects to WiFi as I have WiFi / BLE co-existence. BLE scanning for 1 second at the time, then a short break before I start scanning again.

+1, although the changelog includes this bug fix, it hasn't been resolved in the latest version yet. Also, since the bluetooth controller code is not open source, it is not possible to do anything on our side...

HeFeng1947 commented 2 years ago

Unfortunately, I seem to have the same problem. My idf version is v4.2 c327a0016eaea7b3fd54fb00a9d88211b70b4643 I have enabled ble wifi coexistence.Bluetooth scans all the time and sends it to the cloud server over wifi.

juanaviladev commented 2 years ago

v4.4.1, has the same problem. BLE/Classic BT + Wifi still unusable in a prod environment.

Zimbu98 commented 2 years ago

I had the same problem for months and finally solved it by adjusting the scan interval and scan window. I think the default might be 100/99 which gives very little resources for the ESP to "take out the trash" and "wash the dishes" when you are running a continuous scan like I am.

I changed it to 800/750 and now suddenly the scanner runs and runs without crapping out. Amazing! You can use different numbers I assume, but just be sure to leave some gap between the two numbers so the scanner will take a break and allow the ESP to do some other housekeeping. (I am just speculating that the housekeeping issue is the root of the problem; it might also be that the scanner is overheating or overeating or over somethinging.)

FYI, I had the same problem with the ESP32-WROOM-32U and also with the ESP32-C3. Changing to 800/750 solved it for both of them.

pschlan commented 2 years ago

Are you sure it's really solving it and not just making it less likely? Until the root cause for this is analyzed and fixed, I can't really trust the ESP in a production device when it comes to BT.

freemansu commented 2 years ago

V4.4.3 has the same problem.

thorrak commented 1 year ago

v4.4.4 (via the latest Arduino framework) has the same problem.

My gap between the interval and window is quite wide as per the suggestion by @Zimbu98 - unfortunately this had no effect.