BT Controller - Stops Scanning or responding after random amount of time (IDFGH-1781)

0neblock commented 5 years ago

Brief

I have been having a problem with the Bluedroid BT Controller Scanning function for a few weeks now, and after trying many different things, I am stuck and am not sure what else I can try. The crux of the issue is that the BLE Scanning feature will work for a large amount of time - up to 3 days, then just fail silently, with the whole BT controller seemingly shutting down.

Problem Description

I have a BLE Scanning app that is working well for the most part. It spends most of its time performing an active scan for other BLE devices that are advertising a service UUID and some custom manufacturer data. It receives an advertisement from a sensor around every 1 second, but I can have anywhere from 1-10 sensors within range at any one time.

After a completely random period of time, sometimes 20 minutes, sometimes 3 days. The App will stop receiving ESP_GAP_SEARCH_INQ_RES_EVT events from the bt layer, even though it should still be receiving advertisements form multiple devices, with no indication from any underlying BT Controller debugging that anything has happened. This happens no matter how many sensors I have within range of the ESP, advertising the device, it even happens when I have no sensors advertising, and the general BVLE background advertisements are relatively low.

The free heap memory of the app stays the same (~140kB free memory at any one time), so I can rule out a memory leak on the app side, and the rest of the application keeps running normally, albeit with more computation time from the RTOS (indicated by a loop counter that increases when this error happens), So clearly some of the BT Tasks have stopped running.

When the error happens, I can also see that the ESP itself DOES STOP performing Active scanning, as The sensors I use flash an LED whenever they receive a SCAN_REQUEST from the ESP32 Hardware MAC Address, and this stops happening as soon as the error starts.

If I try and recover from the error, by issuing a command such as esp_ble_gap_start_scanning() - which responds ESP_OK, I get a HCI timeout error printed: BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c. At the moment, trying to perform a bt command after the error, and getting this response, is the only indication from the application that something has gone wrong.

I am not using any WiFi functions, so to reduce memory footprint and file size, I have changed the linker script to only include the following libraries in the component.mk of esp32: core rtc phy instead of the usual: core rtc net80211 pp wpa smartconfig coexist wps wpa2 espnow phy mesh

Coredump

coredump This is a coredump taken about 20 minutes after the error occured. I forced this core dump to log by deliberately throwing an IntegerDivideByZero Exception in another task. My hope here is that it saved the task state of the BT tasks, which your team can use internally to see the task state. If you require My APP ELF I can provide this by email.

Debug Log

This is a log showing the lack of errors I receive when the error happens. As you can see, the application was running for 2.5 days before the error occured. The 'BMS' TAG is my application, and the 'Scanning started' and 'scanning stopped' logs are when my app receives the ESP_GAP_SEARCH_INQ_CMPL_EVT and ESP_GAP_BLE_SCAN_START_COMPLETE_EVT events respectively. In this application, I start a esp_ble_gap_start_scanning operation of 30 seconds, and when I receive a ESP_GAP_SEARCH_INQ_CMPL_EVT event, i set a flag to restart the esp_ble_gap_start_scanning of 30 seconds, in a cycle. Although as discussed later, I have tried changing this interval to anywhere from 30 seconds to 5 minutes, and I have also tried setting the interval to 0 for unlimited, so I only call the start_scan once. In this instance, my pplication received the ESP_GAP_SEARCH_INQ_CMPL_EVT event, so set a flag internally to call esp_ble_gap_start_scanning(30) again, which responded with ESP_OK, but I never received the ESP_GAP_BLE_SCAN_START_COMPLETE_EVT, and about 8 seconds later, I see an error log of command timeout.

I (210356607) BMS[scanning]: Scanning Stopped
I (210356607) BMS[scanning]: Scanning started
I (210363327) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 21, lps: 248
I (210373357) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 20, lps: 249
I (210383367) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 19, lps: 247
I (210386607) BMS[scanning]: Scanning Stopped
I (210386617) BMS[scanning]: Scanning started
I (210393387) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 21, lps: 248
I (210403407) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 19, lps: 248
I (210413457) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451316, used: 202360, attempted: 981230 | pps: 19, lps: 249
I (210416617) BMS[scanning]: Scanning Stopped
I (210423487) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451220, used: 202452, attempted: 981230 | pps: 17, lps: 250
E (210424617) BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c
I (210433497) main: HEAP - free: 140560, largest_block: 98108 | PSRAM - free: 3451220, used: 202452, attempted: 981230 | pps: 25, lps: 250

sdkconfig

Scanning Configuration Used

These are the configurations currently in use, but as you'll see below I have tried many different

static esp_ble_scan_params_t logging_ble_scan_params = {
    .scan_type              = BLE_SCAN_TYPE_ACTIVE,
    .own_addr_type          = BLE_ADDR_TYPE_PUBLIC,
    .scan_filter_policy     = BLE_SCAN_FILTER_ALLOW_ALL, 
    .scan_interval          = 0x100,
    .scan_window            = 0x100,
    .scan_duplicate         = BLE_SCAN_DUPLICATE_DISABLE
};

Changes Attempted

Below is a list of sdkconfig changes of application setup/operation changes that I have tried, with no success , the same thing occurs.

Unlimited scanning timeout - when done this way, I never even get an indication of hci_layer_timeout, because i never try to run another esp_ble_gap_start_scanning call, so the bt controller will fail silently
Turning CONFIG_BLUEDROID_MEM_DEBUG ON - There is no debugging information around the time the error happens
CONFIG_BT_BLE_DYNAMIC_ENV_MEMORY ON/OFF
CONFIG_BT_ALLOCATION_FROM_SPIRAM_FIRST ON/OFF
CONFIG_BLE_HOST_QUEUE_CONGESTION_CHECK ON/OFF
Scanning interval/window change to many different values: 0x100/0x50, 0x200/0x50, 0x1000/0x100, and a few more.
180 second scanning timeout - This seems to make the error happen quicker, although that may be placebo as I only tested a few times.
Increase CONFIG_BTC_TASK_STACK_SIZE and CONFIG_BTU_TASK_STACK_SIZE
Not calling esp_bt_controller_mem_release(ESP_BT_MODE_CLASSIC_BT) (which I usually call before enabling anything.
Enabling/disbling WiFi coexist (I am not using WiFi function at all

If there is anything else I should try, please let me know.

Apologies for the large Github issue, this error has been troubling me for some time and I would like to know what I can try next. Thank you.

Environment

Key	Value
Development Kit	Custom Board
Module or chip used	ESP32-WROVER-32
IDF version	91f29bef172a082cbd8f0208ed1757ede0e1d635 - tracking release/v3.3
Build System	Make
Compiler version	crosstool-ng-1.22.0-80-g6c4433a
Operating System	macOS
Power Supply	external 3.3V

0neblock commented 5 years ago

One thing to note which I just thought of, I am compiling with Arduino-ESP32 as a component. I am not sure if that would affect anything as I am using esp-idf for all the bluetooth-related functionality, as i found the BLE library in Arduino a bit too resource-intensive.

0neblock commented 5 years ago

Hey @igrr (Don't know who else to @), woukd someone be able to have a look at this, this issue is still ongoing for me.

gengyuchao commented 5 years ago

Hello, sorry for the late reply. Do you added a lot of printing in esp_gap_cb? I recommend that you turn off some unnecessary printing, such as remove the print of device name and address from when scanning. This is not the final solution, but try to see if the issue is resolved.Thank you.

0neblock commented 5 years ago

Hi @gengyuchao, thanks for your reply. I do no printing in esp_gap_cb, and i also keep my processing time to a minimum, by adding most events to a queue via xQueueSendFromISR and processing them in another thread.

gengyuchao commented 5 years ago

According to your description, I have not been able to reproduce the problem. Can you give me a sample code of your problem? So I can try to track this problem, thank you.

0neblock commented 5 years ago

Hi @gengyuchao, The problem can take anywhere from 20 minutes to 3 days to happen, and my app is not able to be shared in its current state, however i can share the elf to analyse the coredump if you like.

I will try to build a smaller program that can reproduce the issue.

It seems that the error is that the pre-compiled bt lib is the component that is crashing. Is there anyway i can diagnose the current state of the esp32 bt lib when i detect an error?

0neblock commented 5 years ago

Seems to be related to #4196

0neblock commented 5 years ago

Hi @prasad-alatkar is there any update on this or Issue #4196 ? We are about to enter production and this is still an ongoing issue for us, we are having to call esp_restart every time the BT Controller fails, which is not ideal.

Sushant-Espressif commented 5 years ago

@One

Hi @prasad-alatkar is there any update on this or Issue #4196 ? We are about to enter production and this is still an ongoing issue for us, we are having to call esp_restart every time the BT Controller fails, which is not ideal.

We are working on BT controller firmware fix for the issue. In BLE scan scenario, couple of issues are observed where BT controller reboots with controller level malfunction error code OR just stops responding without any known error. Issue is related to the handling of scan reports in BT controller when there are large number of scan reports in short frame of time. We will release the further details and updated bt lib as soon as possible. Thanks.

0neblock commented 5 years ago

Thanks for your prompt response, good to know you have identified the issue source. Look forward to applying a fix!

0neblock commented 4 years ago

Hi @csushantk I noticed that a few commits were pushed to the Github repo recently, can you please confirm if your fix for this issue was included?

I am tracking release/v3.3 for my IDF toolchain. I am testing this latest release now. Thanks.

pschlang commented 4 years ago

Observing the same issue with v4.0-beta2. Any news here @csushantk ? This is really urgent for us.

plebed commented 4 years ago

The same issue in v3.3.1. Like a ugly workaround, we restart the chip if scan_result->scan_rst.num_resps (in ESP_GAP_SEARCH_INQ_CMPL_EVT) not changed during five scans.

It is strange that scan_result->scan_rst.num_resps is not reset between scans.

0neblock commented 4 years ago

Any updates from anyone on this bug? Still an issue with the latest release/v3.3 branch

plebed commented 4 years ago

I update out firmware to v4.1 (this error did not reproduce in v4.1).

GianlucaLoi commented 4 years ago

Hello @Sushant-Espressif ,

I'm having the same issue of @0neblock with the same "fw environment". I'm using ESP-IDF v4.3-dev-907-g6c17e3a64. Any update about this problem?

Regards,

Gianluca.

Sushant-Espressif commented 4 years ago

@GianlucaLoi @0neblock In our local setup, with Bluedroid Host, we are not able to reproduce the issue of "BLE stops scanning randomly" (tested for one week continuously). Can you please provide more details to reproduce this issue?

Are there excessive prints in your application?
Is application task set to higher priority and hogging the CPU?
Is it possible to share any other details about the application so that we can quickly reproduce the issue?

GianlucaLoi commented 4 years ago

@GianlucaLoi @0neblock In our local setup, with Bluedroid Host, we are not able to reproduce the issue of "BLE stops scanning randomly" (tested for one week continuously). Can you please provide more details to reproduce this issue?

Are there excessive prints in your application?

Is application task set to higher priority and hogging the CPU?

Is it possible to share any other details about the application so that we can quickly reproduce the issue?

Hello @Sushant-Espressif ,

Thanks for the response.

> 1. Are there excessive prints in your application? I have very few prints when my fw is ongoing. At point 3 you can see an example.

> 2. Is application task set to higher priority and hogging the CPU? Could you be more specific?

> 3. Is it possible to share any other details about the application so that we can quickly reproduce the issue? In my application I have WiFi (STA mode), MQTT (no SSL) and BLE. What my FW does is:

At the startup it waits for a WiFi Smart Configuration,
Once is connect to the WiFi, MQTT and BLE task will be initialized and a periodic active scan starts (period about 5seconds) and a task (TASK1) is created to manage the scan data of ble to send the data by MQTT
If there is any data that needs to be publish, MQTT publish function is called by the TASK1

The FW so remains scanning and sending with 5 second period. An example of prints is shown below

I (1231732) TASK1: [APP] Free memory: 4220620 bytes
I (1231746) BLE: Scan started
I (1231833) MQTT: MQTT_EVENT_DATA
I (1236748) BLE: Scan restarting...
I (1236748) TASK1: [APP] Free memory: 4222208 bytes
I (1236751) BLE: SCAN PARAM SET COMPLETE
I (1236763) BLE: Scan started
I (1241765) BLE: Scan restarting...
I (1241768) BLE: SCAN PARAM SET COMPLETE
I (1241773) MQTT: sent publish successful, msg_id=0
I (1241773) TASK1: [APP] Free memory: 4220496 bytes
I (1241815) MQTT: MQTT_EVENT_DATA
E (1249770) BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c

If you need more information, I will be glad to give them.

Regards,

Gianluca.

GianlucaLoi commented 4 years ago

Hello @Sushant-Espressif ,

Do you have any update about this problem?

EDIT: @igrr do you have any solution about this problem? I pulled the ESP-IDF v4.3-dev-1197-g8bc19ba89 where there are a lot of bufixs on bluetooth but I still have this problem.

Regards,

Gianluca.

chhajedji commented 4 years ago

Hi @GianlucaLoi Can you tell us that if you are using advertising report flow control and scan duplicate filtering options? These options can be found in sdkconfig under name CONFIG_BTDM_BLE_ADV_REPORT_FLOW_CTRL_SUPP and CONFIG_BTDM_BLE_SCAN_DUPL.

GianlucaLoi commented 4 years ago

Hello @chhajedji They are both set as YES. Regards, Gianluca.

chhajedji commented 4 years ago

Hi @GianlucaLoi We tried to reproduce your issue, but we didn't get any success. Can you follow below steps and share the logs with us.

Apply this patch present in the attached tarball and also replace the bt lib at $IDF_PATH/components/bt/controller/lib/libbtdm_app.a by the lib in tarball.
Run your program and when crash occurs, store all the logs in a file. This tarball contains a patch for esp-idf and a bt lib. These are not official versions, but just to get the logs of exact scenario. gh_timeout.tar.gz

GianlucaLoi commented 4 years ago

Hello @chhajedji

I performed your steps but, in the Linker section, I obtain these errors from the libbtdm_app.a:

undefined reference to `ke_task_env'
undefined reference to `ke_handler_search'
undefined reference to `ld_pscan_frm_cbk'

How can I solve them? Regards,

Gianluca.

chhajedji commented 4 years ago

Can you try doing a git fetch and git submodule update --init --recursive. Since you are using current master branch and it is getting updated frequently, You will have to do a git checkout 0289d1cc81c210b719f28c65f113c45f9afd2c7b as I have created given patch on this commit.

Also note that you will have to first update submodules (git submodule update --init --recursive) then replace given library. And similarly git fetch and git checkout 0289d1cc81c210b719f28c65f113c45f9afd2c7b and then apply given patch.

GianlucaLoi commented 4 years ago

Hello @chhajedji

I'm still doing your test because I have to adapt some function to your repository to work well. In the meanwhile I tested the ESP-IDF v4.1 and I see this problem also with that version.

One more information to understand the problem (maybe): Because I need an active scan, every 5 seconds I re-set the scan params. When this phase is complete I restart the scanning.

`
static esp_ble_scan_params_t ble_scan_params = {
        .scan_type              = BLE_SCAN_TYPE_ACTIVE,
        .own_addr_type          = BLE_ADDR_TYPE_PUBLIC,
        .scan_filter_policy     = BLE_SCAN_FILTER_ALLOW_ALL,
        .scan_interval          = 0x50,
        .scan_window            = 0x30
};

...
case ESP_GAP_BLE_SCAN_PARAM_SET_COMPLETE_EVT:                   
        if(param->scan_param_cmpl.status == ESP_BT_STATUS_SUCCESS)
        {
            ESP_LOGI(BLE_LOG_TAG,"SCAN PARAM SET COMPLETE");
            esp_ble_gap_start_scanning(5);
        }
        else
        {
            ESP_LOGE(BLE_LOG_TAG,"SCAN PARAM SET NOT COMPLETE");
        }
        break;

...
case ESP_GAP_SEARCH_INQ_CMPL_EVT:
                ...
        esp_ble_gap_set_scan_params(&ble_scan_params);
        break;
...
`

Regards,

Gianluca.

chhajedji commented 4 years ago

Hi @GianlucaLoi

I am also testing with this parameters and see if I can reproduce it. In case you get the crash, please share the logs.

GianlucaLoi commented 4 years ago

Hello @chhajedji and @0neblock

I haven't tried your suggestion yet but it seems I resolved with these modifications in the SDK:

CONFIG_BT_BTC_TASK_STACK_SIZE=4096
CONFIG_BT_BTU_TASK_STACK_SIZE=4096
CONFIG_BT_ALLOCATION_FROM_SPIRAM_FIRST=y
CONFIG_BT_BLE_DYNAMIC_ENV_MEMORY=y
CONFIG_BT_BLE_HOST_QUEUE_CONG_CHECK=y

In your opinion, is this setting a valid solution for this problem? Regards, Gianluca

vbvchauthmal commented 3 years ago

Facing similar issue in production devices and currently I added execution of esp_restart() when BLE scan stopped without raising any issue at random interval. I am using ESP-IDF version v3.3.1.

chhajedji commented 3 years ago

HI @vbvchauthmal Your ESP-IDF version looks bit old. This issue has been backported in the commit 86de4055 for v3.3 release. We have released v3.3.5 which also contains the fix. Please try with this version.

vbvchauthmal commented 3 years ago

Hi @chhajedji, I have updated the ESP-IDF version to v3.3.5 and executed submodules update using command git submodule update --init --recursive. After IDF version upgrade, my firmware build is getting failed with following logs :

Python requirements from /home/yantrr/git-repos/OMNY_rel2_3/src/esp-idf/requirements.txt are satisfied.
Building partitions from /home/yantrr/git-repos/OMNY_rel2_3/src/HUB-FW/partitions.csv...
usage: espsecure sign_data [-h] --version {1,2} --keyfile KEYFILE
                           [KEYFILE ...] [--output OUTPUT]
                           datafile
espsecure sign_data: error: argument --version/-v is required
/home/yantrr/git-repos/OMNY_rel2_3/src/esp-idf/components/partition_table/Makefile.projbuild:53: recipe for target /home/yantrr/git-repos/OMNY_rel2_3/src/HUB-FW/build/partitions.bin' failed
make[1]: *** [/home/yantrr/git-repos/OMNY_rel2_3/src/HUB-FW/build/partitions.bin] Error 2
Makefile:49: recipe for target 'firmware' failed
make: *** [firmware] Error 2

chhajedji commented 3 years ago

Hi @vbvchauthmal

This looks like a bug in esp-idf. Can you apply this patch and try to build. This is not an exact fix, but for your case this should work.

build_fix.txt

vbvchauthmal commented 3 years ago

Hi @chhajedji I used patch from repo esptool which resolved my issue of firmware build and flashing it. Patch link for fixing sign_data error argument : 5b8c2f1b02d0e4b9bac0756b3ead66e8beea428a.

I flashed my firmware with only update of IDF version v3.3.5 and I still observe that this issue of stopping of BLE scan at random interval. I flashed this update on multiple ESP32 based hubs for testing and all are showing this BLE scan stopping issue at random interval.

vbvchauthmal commented 3 years ago

Hi @chhajedji, Do you have any update on this?

I have included BT controller & VHCI and Bluedroid get status API call in my code where I am observing following error logs when BLE scan is stopped or not responding before I executes esp_restart() for reboot :

yantrr@yantrr-ws2:rel2-test$ egrep "scan_evt timeout|ble_timer_scan_fail_count|Get BLE TX power for PWR_TYPE_SCAN|BT controller status|BT controller status|vhci host check for sending packet to controller|BT_HCI|Version:|IDF" usb2_2021-05-14-22.22.02.log
I (1057) cpu_start: ESP-IDF:          v3.3.5-dirty
I (6090) [APP-MAIN]: Version: v2.3.2-beta8-ohirel2_3   MacID: 24:62:ab:ef:b2:88
scan_evt timeout
E (3005880) BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c
I (3033940) [BLE-MASTER]: ble_timer_scan_fail_count : 1
I (3033940) [BLE-MASTER]: ble_timer_scan_fail_count failed, esp_restart()
I (3033940) [BLE-MASTER]: Get BLE TX power for PWR_TYPE_SCAN : 5
I (3033950) [BLE-MASTER]: BT controller status : 2
I (3033950) [BLE-MASTER]: vhci host check for sending packet to controller, status : 1
I (3033950) [BLE-MASTER]: Bluedroid Status : 2
I (1055) cpu_start: ESP-IDF:          v3.3.5-dirty
I (6104) [APP-MAIN]: Version: v2.3.2-beta8-ohirel2_3   MacID: 24:62:ab:ef:b2:88
scan_evt timeout
scan_evt timeout
E (6569124) BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c
I (6616944) [BLE-MASTER]: ble_timer_scan_fail_count : 1
I (6616944) [BLE-MASTER]: ble_timer_scan_fail_count failed, esp_restart()
I (6616944) [BLE-MASTER]: Get BLE TX power for PWR_TYPE_SCAN : 5
I (6616954) [BLE-MASTER]: BT controller status : 2
I (6616954) [BLE-MASTER]: vhci host check for sending packet to controller, status : 1
I (6616954) [BLE-MASTER]: Bluedroid Status : 2
I (1055) cpu_start: ESP-IDF:          v3.3.5-dirty
I (6104) [APP-MAIN]: Version: v2.3.2-beta8-ohirel2_3   MacID: 24:62:ab:ef:b2:88
scan_evt timeout
E (4419764) BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c
I (4538624) [BLE-MASTER]: ble_timer_scan_fail_count : 1
I (4538624) [BLE-MASTER]: ble_timer_scan_fail_count failed, esp_restart()
I (4538624) [BLE-MASTER]: Get BLE TX power for PWR_TYPE_SCAN : 5
I (4538634) [BLE-MASTER]: BT controller status : 2
I (4538634) [BLE-MASTER]: vhci host check for sending packet to controller, status : 1
I (4538634) [BLE-MASTER]: Bluedroid Status : 2
I (1055) cpu_start: ESP-IDF:          v3.3.5-dirty
I (6104) [APP-MAIN]: Version: v2.3.2-beta8-ohirel2_3   MacID: 24:62:ab:ef:b2:88

I have included sdkconfig parameters suggestion mentioned by @GianlucaLoi, but didn't helped in resolving this issue.

chrisomatic commented 3 years ago

We've noticed this same issue on multiple versions of the IDF (v3.3.3, v4.1, v4.2, v4.2.1). BLE scanning and BLE server advertising will stop at intermittent times and require a power cycle to fix. Sometimes this happens after an hour, sometimes after a few days. We've implemented a daily reboot and have attempted to implement a system where a reboot will happen if the scan event hasn't fired for awhile. We've tried many different configurations, but nothing has worked to resolve the issue. It seems to happen more frequently when a lot of BLE devices are around with scanning. This is a major concern for us.

pschlan commented 3 years ago

Same here, even switched to entirely different SoC for a large-volume product because of this.

chhajedji commented 3 years ago

Hi @vbvchauthmal,

I will trying to recreate the issue. Although I tried same earlier for @GianlucaLoi and before I could recreate it, changing some parameters helped for them. Please share some more details about your failing scenario through which I can reproduce it.

Which commit id are you using for your application?
What exactly are you doing in your application (scanning/advertising/both, anything else also) and what are the parameters for the same (scan params, adv params)?
Which idf example will most closely resemble your application and what are the changes in it to emulate your scenario. Or if you can share your application, that would be better.
How many devices are there in the vicinity and what are they doing (how many advertisers and scanners nearby)?
How long does this issue take to occur for your case and does this time vary?

Also please provide any other information you feel which could be helpful to recreate or solve this issue.

vbvchauthmal commented 3 years ago

Hi @chhajedji

Hi @vbvchauthmal,

I will trying to recreate the issue. Although I tried same earlier for @GianlucaLoi and before I could recreate it, changing some parameters helped for them. Please share some more details about your failing scenario through which I can reproduce it.
* Which commit id are you using for your application?

I am using ESP-IDF version v3.3.5 (commit id : 03810c4a065a1ecdd24a803b2c9dc4e834c7dab5) after your suggestion, earlier I was using v3.3.1 (commit-id : 143d26aa49df524e10fb8e41a71d12e731b9b71d )

* What exactly are you doing in your application (scanning/advertising/both, anything else also) and what are the parameters for the same (scan params, adv params)?
I am doing BLE scanning and below are the scan parameters set in my source code :
ble_scan_params = {
.scan_type              = BLE_SCAN_TYPE_ACTIVE,
.own_addr_type          = BLE_ADDR_TYPE_PUBLIC,
.scan_filter_policy     = BLE_SCAN_FILTER_ALLOW_ALL,
.scan_interval          = 0xF0,                         // Interval between the start of two consecutive scan windows. Dec(0xF0) = 240 x 0.625 = 150ms
.scan_window            = 0xF0                          // The duration in which the Link Layer scans on one channel. Dec(0xF0) = 240 x 0.625 = 150ms
};
* Which idf example will most closely resemble your application and what are the changes in it to emulate your scenario. Or if you can share your application, that would be better.
The idf example closely resemble with my application is gattc_multi_connect. This application extended with setting of BLE GAP security parameters and supporting interfacing of five BLE peripheral devices. At a time only one BLE peripheral will be allowed to connect when its broadcasting is captured to get sensor readings i.e. through BLE notifications/indications.
* How many devices are there in the vicinity and what are they doing (how many advertisers and scanners nearby)?
Till now we have deployed 6000 of our ESP32 based platform with this developed firmware and all must have different numbers of BLE devices in vicinity which can be advertisers or scanners. Most of these deployed showing this issue.
* How long does this issue take to occur for your case and does this time vary?
Its occurring at random sometimes it will arise after week or sometimes it will take few minutes or hours.

Also please provide any other information you feel which could be helpful to recreate or solve this issue.

Query :

Is there any API (for checking the Bluetooth radio / BLE scanning status) which I can execute for knowing the BLE scan is stopped or hanged? I observed sometimes this issue reproduced without any BT_HCI errors in log, so wanted to get status of underlying BLE scan so I can reset bluetooth instead of executing esp_restart().
What is the proper sequence to reset bluetooth ?

Rokachy commented 3 years ago

We are facing the same issue We are working with 4.2.1 release, we tried also v4.3-beta3 tag and also v4.4-dev tag and the issue is also there. When this issue occur, BT radio status looks good no error reported so software cannot detect this... We tried many option to reset BT only, but scan not operational after that. Only esp_restart() recover but we cannot use it since in our app we cannot loose BT radio more than 1 minute ( it not applicable for us to run esp_restart() every minute!!). This is issue will kill our project....

TianaESP commented 3 years ago

@Rokachy Can you please try with the latest v4.3 release? We did a test and did not reproduce the issue. We are still testing with mass devices on the same.

Rokachy commented 3 years ago

Yes, I will and let you know for results. Is a random issue, it can appear from few hours to few days.... Is there anything that I can read / get status from the device to help debug it? Do you want me to open/enable other than logs?

Yehuda

TianaESP commented 3 years ago

@Rokachy Yes, it is a random issue. Am afraid no need to enable anything at the moment, the best would be packets capture. Please try with latest v4.3 first. Thanks.

Rokachy commented 3 years ago

Its runs with v4.3 release for a few days, no issue so far -:) We will continue to run it for few days more. What have been fixed at the SDK v4.3?

0neblock commented 3 years ago

@TianaESP Any chances of this (possible) fix being backported to v3.3?

TianaESP commented 3 years ago

@0neblock The fixes were backported to v3.3, v4.0, v4.1, v4.2. Please try the latest v3.3. Thanks.

TianaESP commented 3 years ago

@Rokachy We fixed bugs in modem sleep that we suspected were contributing to the problem. Please let us know if the issue happens again. Thanks.

Rokachy commented 3 years ago

Thank you for the support, we are using release 4.3 and we didn't see the issue for the last 2 weeks. I hope it will be kept like this 😀.

Thank you for the good work. Yehuda

On Mon, Jun 28, 2021, 14:46 TianaESP @.***> wrote:

@Rokachy https://github.com/Rokachy We fixed bugs in modem sleep that we suspected were contributing to the problem. Please let us know if the issue happens again. Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/espressif/esp-idf/issues/4001#issuecomment-869615461, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS7K4PJZQ6Z4A7UH2VWTDPTTVBOIVANCNFSM4ITBESYQ .

Rokachy commented 3 years ago

The fixes were backported to v3.3, v4.0, v4.1, v4.2. Please try the latest v3.3.

is the fix backported to 4.0.3 version too?

Alvin1Zhang commented 3 years ago

Thanks for sharing updates, the fix has already backported to release/4.0 https://github.com/espressif/esp-idf/commit/b89b1ec0223d4f021ce401a8f85e751dd67c8358, thanks.

MartinTJDK commented 3 years ago

I have observed an issue, which might relate to this. At least the result is the same: BLE stack stops to work properly...

In my situation, I am scanning for BLE advertisements, and at some point in time, the scan stops. It typically happens when ESP32 is busy (e.g. writing a lot of information to Debug Console).

I enabled "CONFIG_BLE_HOST_QUEUE_CONGESTION_CHECK", which helps a lot, and actually shows, that BTU queue "often" has congestion.

But I also observed (easy to reproduce by changing "BT_QUEUE_CONGEST_SIZE" to 20 in file "bt_common.h"), that when congestion occurs, it actually locks the ble stack completely. Sometimes it does not recover, and I believe it is caused by "hciH4T" task having higher priority - and a lot event to process.

If I allow "hciH4T", "btuT" and "BTC_TASK" tasks to use same priority, I do not see this lockup. Perhaps someone from Espressif (@TianaESP) could confirm the issue?

juanaviladev commented 2 years ago

Hi! Any update about this? I noticed this same issue when using BLE + Classic (v4.4-beta1). Thanks!

edit: v4.3.1 is also affected

espressif / esp-idf