Open 0neblock opened 5 years ago
One thing to note which I just thought of, I am compiling with Arduino-ESP32 as a component. I am not sure if that would affect anything as I am using esp-idf for all the bluetooth-related functionality, as i found the BLE library in Arduino a bit too resource-intensive.
Hey @igrr (Don't know who else to @), woukd someone be able to have a look at this, this issue is still ongoing for me.
Hello, sorry for the late reply. Do you added a lot of printing in esp_gap_cb? I recommend that you turn off some unnecessary printing, such as remove the print of device name and address from when scanning. This is not the final solution, but try to see if the issue is resolved.Thank you.
Hi @gengyuchao, thanks for your reply. I do no printing in esp_gap_cb, and i also keep my processing time to a minimum, by adding most events to a queue via xQueueSendFromISR and processing them in another thread.
According to your description, I have not been able to reproduce the problem. Can you give me a sample code of your problem? So I can try to track this problem, thank you.
Hi @gengyuchao, The problem can take anywhere from 20 minutes to 3 days to happen, and my app is not able to be shared in its current state, however i can share the elf to analyse the coredump if you like.
I will try to build a smaller program that can reproduce the issue.
It seems that the error is that the pre-compiled bt lib is the component that is crashing. Is there anyway i can diagnose the current state of the esp32 bt lib when i detect an error?
Seems to be related to #4196
Hi @prasad-alatkar is there any update on this or Issue #4196 ? We are about to enter production and this is still an ongoing issue for us, we are having to call esp_restart every time the BT Controller fails, which is not ideal.
@One
Hi @prasad-alatkar is there any update on this or Issue #4196 ? We are about to enter production and this is still an ongoing issue for us, we are having to call esp_restart every time the BT Controller fails, which is not ideal.
We are working on BT controller firmware fix for the issue. In BLE scan scenario, couple of issues are observed where BT controller reboots with controller level malfunction error code OR just stops responding without any known error. Issue is related to the handling of scan reports in BT controller when there are large number of scan reports in short frame of time. We will release the further details and updated bt lib as soon as possible. Thanks.
Thanks for your prompt response, good to know you have identified the issue source. Look forward to applying a fix!
Hi @csushantk I noticed that a few commits were pushed to the Github repo recently, can you please confirm if your fix for this issue was included?
I am tracking release/v3.3 for my IDF toolchain. I am testing this latest release now. Thanks.
Observing the same issue with v4.0-beta2. Any news here @csushantk ? This is really urgent for us.
The same issue in v3.3.1. Like a ugly workaround, we restart the chip if scan_result->scan_rst.num_resps (in ESP_GAP_SEARCH_INQ_CMPL_EVT) not changed during five scans.
It is strange that scan_result->scan_rst.num_resps is not reset between scans.
Any updates from anyone on this bug? Still an issue with the latest release/v3.3 branch
I update out firmware to v4.1 (this error did not reproduce in v4.1).
Hello @Sushant-Espressif ,
I'm having the same issue of @0neblock with the same "fw environment". I'm using ESP-IDF v4.3-dev-907-g6c17e3a64. Any update about this problem?
Regards,
Gianluca.
@GianlucaLoi @0neblock In our local setup, with Bluedroid Host, we are not able to reproduce the issue of "BLE stops scanning randomly" (tested for one week continuously). Can you please provide more details to reproduce this issue?
@GianlucaLoi @0neblock In our local setup, with Bluedroid Host, we are not able to reproduce the issue of "BLE stops scanning randomly" (tested for one week continuously). Can you please provide more details to reproduce this issue?
- Are there excessive prints in your application?
- Is application task set to higher priority and hogging the CPU?
- Is it possible to share any other details about the application so that we can quickly reproduce the issue?
Hello @Sushant-Espressif ,
Thanks for the response.
> 1. Are there excessive prints in your application? I have very few prints when my fw is ongoing. At point 3 you can see an example.
> 2. Is application task set to higher priority and hogging the CPU? Could you be more specific?
> 3. Is it possible to share any other details about the application so that we can quickly reproduce the issue? In my application I have WiFi (STA mode), MQTT (no SSL) and BLE. What my FW does is:
I (1231732) TASK1: [APP] Free memory: 4220620 bytes
I (1231746) BLE: Scan started
I (1231833) MQTT: MQTT_EVENT_DATA
I (1236748) BLE: Scan restarting...
I (1236748) TASK1: [APP] Free memory: 4222208 bytes
I (1236751) BLE: SCAN PARAM SET COMPLETE
I (1236763) BLE: Scan started
I (1241765) BLE: Scan restarting...
I (1241768) BLE: SCAN PARAM SET COMPLETE
I (1241773) MQTT: sent publish successful, msg_id=0
I (1241773) TASK1: [APP] Free memory: 4220496 bytes
I (1241815) MQTT: MQTT_EVENT_DATA
E (1249770) BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c
If you need more information, I will be glad to give them.
Regards,
Gianluca.
Hello @Sushant-Espressif ,
Do you have any update about this problem?
EDIT: @igrr do you have any solution about this problem? I pulled the ESP-IDF v4.3-dev-1197-g8bc19ba89 where there are a lot of bufixs on bluetooth but I still have this problem.
Regards,
Gianluca.
Hi @GianlucaLoi
Can you tell us that if you are using advertising report flow control and scan duplicate filtering options?
These options can be found in sdkconfig under name CONFIG_BTDM_BLE_ADV_REPORT_FLOW_CTRL_SUPP
and CONFIG_BTDM_BLE_SCAN_DUPL
.
Hello @chhajedji They are both set as YES. Regards, Gianluca.
Hi @GianlucaLoi We tried to reproduce your issue, but we didn't get any success. Can you follow below steps and share the logs with us.
$IDF_PATH/components/bt/controller/lib/libbtdm_app.a
by the lib in tarball.Hello @chhajedji
I performed your steps but, in the Linker section, I obtain these errors from the libbtdm_app.a:
How can I solve them? Regards,
Gianluca.
Can you try doing a git fetch
and git submodule update --init --recursive
.
Since you are using current master branch and it is getting updated frequently, You will have to do a git checkout 0289d1cc81c210b719f28c65f113c45f9afd2c7b
as I have created given patch on this commit.
Also note that you will have to first update submodules (git submodule update --init --recursive
) then replace given library. And similarly git fetch
and git checkout 0289d1cc81c210b719f28c65f113c45f9afd2c7b
and then apply given patch.
Hello @chhajedji
I'm still doing your test because I have to adapt some function to your repository to work well. In the meanwhile I tested the ESP-IDF v4.1 and I see this problem also with that version.
One more information to understand the problem (maybe): Because I need an active scan, every 5 seconds I re-set the scan params. When this phase is complete I restart the scanning.
`
static esp_ble_scan_params_t ble_scan_params = {
.scan_type = BLE_SCAN_TYPE_ACTIVE,
.own_addr_type = BLE_ADDR_TYPE_PUBLIC,
.scan_filter_policy = BLE_SCAN_FILTER_ALLOW_ALL,
.scan_interval = 0x50,
.scan_window = 0x30
};
...
case ESP_GAP_BLE_SCAN_PARAM_SET_COMPLETE_EVT:
if(param->scan_param_cmpl.status == ESP_BT_STATUS_SUCCESS)
{
ESP_LOGI(BLE_LOG_TAG,"SCAN PARAM SET COMPLETE");
esp_ble_gap_start_scanning(5);
}
else
{
ESP_LOGE(BLE_LOG_TAG,"SCAN PARAM SET NOT COMPLETE");
}
break;
...
case ESP_GAP_SEARCH_INQ_CMPL_EVT:
...
esp_ble_gap_set_scan_params(&ble_scan_params);
break;
...
`
Regards,
Gianluca.
Hi @GianlucaLoi
I am also testing with this parameters and see if I can reproduce it. In case you get the crash, please share the logs.
Hello @chhajedji and @0neblock
I haven't tried your suggestion yet but it seems I resolved with these modifications in the SDK:
In your opinion, is this setting a valid solution for this problem? Regards, Gianluca
Facing similar issue in production devices and currently I added execution of esp_restart() when BLE scan stopped without raising any issue at random interval. I am using ESP-IDF version v3.3.1.
HI @vbvchauthmal Your ESP-IDF version looks bit old. This issue has been backported in the commit 86de4055 for v3.3 release. We have released v3.3.5 which also contains the fix. Please try with this version.
Hi @chhajedji,
I have updated the ESP-IDF version to v3.3.5 and executed submodules update using command git submodule update --init --recursive
.
After IDF version upgrade, my firmware build is getting failed with following logs :
Python requirements from /home/yantrr/git-repos/OMNY_rel2_3/src/esp-idf/requirements.txt are satisfied.
Building partitions from /home/yantrr/git-repos/OMNY_rel2_3/src/HUB-FW/partitions.csv...
usage: espsecure sign_data [-h] --version {1,2} --keyfile KEYFILE
[KEYFILE ...] [--output OUTPUT]
datafile
espsecure sign_data: error: argument --version/-v is required
/home/yantrr/git-repos/OMNY_rel2_3/src/esp-idf/components/partition_table/Makefile.projbuild:53: recipe for target /home/yantrr/git-repos/OMNY_rel2_3/src/HUB-FW/build/partitions.bin' failed
make[1]: *** [/home/yantrr/git-repos/OMNY_rel2_3/src/HUB-FW/build/partitions.bin] Error 2
Makefile:49: recipe for target 'firmware' failed
make: *** [firmware] Error 2
Hi @vbvchauthmal
This looks like a bug in esp-idf. Can you apply this patch and try to build. This is not an exact fix, but for your case this should work.
Hi @chhajedji I used patch from repo esptool which resolved my issue of firmware build and flashing it. Patch link for fixing sign_data error argument : 5b8c2f1b02d0e4b9bac0756b3ead66e8beea428a.
I flashed my firmware with only update of IDF version v3.3.5 and I still observe that this issue of stopping of BLE scan at random interval. I flashed this update on multiple ESP32 based hubs for testing and all are showing this BLE scan stopping issue at random interval.
Hi @chhajedji, Do you have any update on this?
I have included BT controller & VHCI and Bluedroid get status API call in my code where I am observing following error logs when BLE scan is stopped or not responding before I executes esp_restart() for reboot :
yantrr@yantrr-ws2:rel2-test$ egrep "scan_evt timeout|ble_timer_scan_fail_count|Get BLE TX power for PWR_TYPE_SCAN|BT controller status|BT controller status|vhci host check for sending packet to controller|BT_HCI|Version:|IDF" usb2_2021-05-14-22.22.02.log
I (1057) cpu_start: ESP-IDF: v3.3.5-dirty
I (6090) [APP-MAIN]: Version: v2.3.2-beta8-ohirel2_3 MacID: 24:62:ab:ef:b2:88
scan_evt timeout
E (3005880) BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c
I (3033940) [BLE-MASTER]: ble_timer_scan_fail_count : 1
I (3033940) [BLE-MASTER]: ble_timer_scan_fail_count failed, esp_restart()
I (3033940) [BLE-MASTER]: Get BLE TX power for PWR_TYPE_SCAN : 5
I (3033950) [BLE-MASTER]: BT controller status : 2
I (3033950) [BLE-MASTER]: vhci host check for sending packet to controller, status : 1
I (3033950) [BLE-MASTER]: Bluedroid Status : 2
I (1055) cpu_start: ESP-IDF: v3.3.5-dirty
I (6104) [APP-MAIN]: Version: v2.3.2-beta8-ohirel2_3 MacID: 24:62:ab:ef:b2:88
scan_evt timeout
scan_evt timeout
E (6569124) BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c
I (6616944) [BLE-MASTER]: ble_timer_scan_fail_count : 1
I (6616944) [BLE-MASTER]: ble_timer_scan_fail_count failed, esp_restart()
I (6616944) [BLE-MASTER]: Get BLE TX power for PWR_TYPE_SCAN : 5
I (6616954) [BLE-MASTER]: BT controller status : 2
I (6616954) [BLE-MASTER]: vhci host check for sending packet to controller, status : 1
I (6616954) [BLE-MASTER]: Bluedroid Status : 2
I (1055) cpu_start: ESP-IDF: v3.3.5-dirty
I (6104) [APP-MAIN]: Version: v2.3.2-beta8-ohirel2_3 MacID: 24:62:ab:ef:b2:88
scan_evt timeout
E (4419764) BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c
I (4538624) [BLE-MASTER]: ble_timer_scan_fail_count : 1
I (4538624) [BLE-MASTER]: ble_timer_scan_fail_count failed, esp_restart()
I (4538624) [BLE-MASTER]: Get BLE TX power for PWR_TYPE_SCAN : 5
I (4538634) [BLE-MASTER]: BT controller status : 2
I (4538634) [BLE-MASTER]: vhci host check for sending packet to controller, status : 1
I (4538634) [BLE-MASTER]: Bluedroid Status : 2
I (1055) cpu_start: ESP-IDF: v3.3.5-dirty
I (6104) [APP-MAIN]: Version: v2.3.2-beta8-ohirel2_3 MacID: 24:62:ab:ef:b2:88
I have included sdkconfig parameters suggestion mentioned by @GianlucaLoi, but didn't helped in resolving this issue.
We've noticed this same issue on multiple versions of the IDF (v3.3.3, v4.1, v4.2, v4.2.1). BLE scanning and BLE server advertising will stop at intermittent times and require a power cycle to fix. Sometimes this happens after an hour, sometimes after a few days. We've implemented a daily reboot and have attempted to implement a system where a reboot will happen if the scan event hasn't fired for awhile. We've tried many different configurations, but nothing has worked to resolve the issue. It seems to happen more frequently when a lot of BLE devices are around with scanning. This is a major concern for us.
Same here, even switched to entirely different SoC for a large-volume product because of this.
Hi @vbvchauthmal,
I will trying to recreate the issue. Although I tried same earlier for @GianlucaLoi and before I could recreate it, changing some parameters helped for them. Please share some more details about your failing scenario through which I can reproduce it.
Also please provide any other information you feel which could be helpful to recreate or solve this issue.
Hi @chhajedji
Hi @vbvchauthmal,
I will trying to recreate the issue. Although I tried same earlier for @GianlucaLoi and before I could recreate it, changing some parameters helped for them. Please share some more details about your failing scenario through which I can reproduce it.
* Which commit id are you using for your application?
I am using ESP-IDF version v3.3.5 (commit id : 03810c4a065a1ecdd24a803b2c9dc4e834c7dab5) after your suggestion, earlier I was using v3.3.1 (commit-id : 143d26aa49df524e10fb8e41a71d12e731b9b71d )
* What exactly are you doing in your application (scanning/advertising/both, anything else also) and what are the parameters for the same (scan params, adv params)?
I am doing BLE scanning and below are the scan parameters set in my source code :
ble_scan_params = { .scan_type = BLE_SCAN_TYPE_ACTIVE, .own_addr_type = BLE_ADDR_TYPE_PUBLIC, .scan_filter_policy = BLE_SCAN_FILTER_ALLOW_ALL, .scan_interval = 0xF0, // Interval between the start of two consecutive scan windows. Dec(0xF0) = 240 x 0.625 = 150ms .scan_window = 0xF0 // The duration in which the Link Layer scans on one channel. Dec(0xF0) = 240 x 0.625 = 150ms };
* Which idf example will most closely resemble your application and what are the changes in it to emulate your scenario. Or if you can share your application, that would be better.
The idf example closely resemble with my application is gattc_multi_connect. This application extended with setting of BLE GAP security parameters and supporting interfacing of five BLE peripheral devices. At a time only one BLE peripheral will be allowed to connect when its broadcasting is captured to get sensor readings i.e. through BLE notifications/indications.
* How many devices are there in the vicinity and what are they doing (how many advertisers and scanners nearby)?
Till now we have deployed 6000 of our ESP32 based platform with this developed firmware and all must have different numbers of BLE devices in vicinity which can be advertisers or scanners. Most of these deployed showing this issue.
* How long does this issue take to occur for your case and does this time vary?
Its occurring at random sometimes it will arise after week or sometimes it will take few minutes or hours.
Also please provide any other information you feel which could be helpful to recreate or solve this issue.
Query :
We are facing the same issue We are working with 4.2.1 release, we tried also v4.3-beta3 tag and also v4.4-dev tag and the issue is also there. When this issue occur, BT radio status looks good no error reported so software cannot detect this... We tried many option to reset BT only, but scan not operational after that. Only esp_restart() recover but we cannot use it since in our app we cannot loose BT radio more than 1 minute ( it not applicable for us to run esp_restart() every minute!!). This is issue will kill our project....
@Rokachy Can you please try with the latest v4.3 release? We did a test and did not reproduce the issue. We are still testing with mass devices on the same.
Yes, I will and let you know for results. Is a random issue, it can appear from few hours to few days.... Is there anything that I can read / get status from the device to help debug it? Do you want me to open/enable other than logs?
Yehuda
@Rokachy Yes, it is a random issue. Am afraid no need to enable anything at the moment, the best would be packets capture. Please try with latest v4.3 first. Thanks.
Its runs with v4.3 release for a few days, no issue so far -:) We will continue to run it for few days more. What have been fixed at the SDK v4.3?
@TianaESP Any chances of this (possible) fix being backported to v3.3?
@0neblock The fixes were backported to v3.3, v4.0, v4.1, v4.2. Please try the latest v3.3. Thanks.
@Rokachy We fixed bugs in modem sleep that we suspected were contributing to the problem. Please let us know if the issue happens again. Thanks.
Thank you for the support, we are using release 4.3 and we didn't see the issue for the last 2 weeks. I hope it will be kept like this š.
Thank you for the good work. Yehuda
On Mon, Jun 28, 2021, 14:46 TianaESP @.***> wrote:
@Rokachy https://github.com/Rokachy We fixed bugs in modem sleep that we suspected were contributing to the problem. Please let us know if the issue happens again. Thanks.
ā You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/espressif/esp-idf/issues/4001#issuecomment-869615461, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS7K4PJZQ6Z4A7UH2VWTDPTTVBOIVANCNFSM4ITBESYQ .
The fixes were backported to v3.3, v4.0, v4.1, v4.2. Please try the latest v3.3.
is the fix backported to 4.0.3 version too?
Thanks for sharing updates, the fix has already backported to release/4.0 https://github.com/espressif/esp-idf/commit/b89b1ec0223d4f021ce401a8f85e751dd67c8358, thanks.
I have observed an issue, which might relate to this. At least the result is the same: BLE stack stops to work properly...
In my situation, I am scanning for BLE advertisements, and at some point in time, the scan stops. It typically happens when ESP32 is busy (e.g. writing a lot of information to Debug Console).
I enabled "CONFIG_BLE_HOST_QUEUE_CONGESTION_CHECK", which helps a lot, and actually shows, that BTU queue "often" has congestion.
But I also observed (easy to reproduce by changing "BT_QUEUE_CONGEST_SIZE" to 20 in file "bt_common.h"), that when congestion occurs, it actually locks the ble stack completely. Sometimes it does not recover, and I believe it is caused by "hciH4T" task having higher priority - and a lot event to process.
If I allow "hciH4T", "btuT" and "BTC_TASK" tasks to use same priority, I do not see this lockup. Perhaps someone from Espressif (@TianaESP) could confirm the issue?
Hi! Any update about this? I noticed this same issue when using BLE + Classic (v4.4-beta1). Thanks!
edit: v4.3.1 is also affected
Brief
I have been having a problem with the Bluedroid BT Controller Scanning function for a few weeks now, and after trying many different things, I am stuck and am not sure what else I can try. The crux of the issue is that the BLE Scanning feature will work for a large amount of time - up to 3 days, then just fail silently, with the whole BT controller seemingly shutting down.
Problem Description
I have a BLE Scanning app that is working well for the most part. It spends most of its time performing an active scan for other BLE devices that are advertising a service UUID and some custom manufacturer data. It receives an advertisement from a sensor around every 1 second, but I can have anywhere from 1-10 sensors within range at any one time.
After a completely random period of time, sometimes 20 minutes, sometimes 3 days. The App will stop receiving ESP_GAP_SEARCH_INQ_RES_EVT events from the bt layer, even though it should still be receiving advertisements form multiple devices, with no indication from any underlying BT Controller debugging that anything has happened. This happens no matter how many sensors I have within range of the ESP, advertising the device, it even happens when I have no sensors advertising, and the general BVLE background advertisements are relatively low.
The free heap memory of the app stays the same (~140kB free memory at any one time), so I can rule out a memory leak on the app side, and the rest of the application keeps running normally, albeit with more computation time from the RTOS (indicated by a loop counter that increases when this error happens), So clearly some of the BT Tasks have stopped running.
When the error happens, I can also see that the ESP itself DOES STOP performing Active scanning, as The sensors I use flash an LED whenever they receive a SCAN_REQUEST from the ESP32 Hardware MAC Address, and this stops happening as soon as the error starts.
If I try and recover from the error, by issuing a command such as
esp_ble_gap_start_scanning()
- which responds ESP_OK, I get a HCI timeout error printed:BT_HCI: command_timed_out hci layer timeout waiting for response to a command. opcode: 0x200c
. At the moment, trying to perform a bt command after the error, and getting this response, is the only indication from the application that something has gone wrong.I am not using any WiFi functions, so to reduce memory footprint and file size, I have changed the linker script to only include the following libraries in the component.mk of esp32:
core rtc phy
instead of the usual:core rtc net80211 pp wpa smartconfig coexist wps wpa2 espnow phy mesh
Coredump
coredump This is a coredump taken about 20 minutes after the error occured. I forced this core dump to log by deliberately throwing an IntegerDivideByZero Exception in another task. My hope here is that it saved the task state of the BT tasks, which your team can use internally to see the task state. If you require My APP ELF I can provide this by email.
Debug Log
This is a log showing the lack of errors I receive when the error happens. As you can see, the application was running for 2.5 days before the error occured. The 'BMS' TAG is my application, and the 'Scanning started' and 'scanning stopped' logs are when my app receives the
ESP_GAP_SEARCH_INQ_CMPL_EVT
andESP_GAP_BLE_SCAN_START_COMPLETE_EVT
events respectively. In this application, I start a esp_ble_gap_start_scanning operation of 30 seconds, and when I receive a ESP_GAP_SEARCH_INQ_CMPL_EVT event, i set a flag to restart the esp_ble_gap_start_scanning of 30 seconds, in a cycle. Although as discussed later, I have tried changing this interval to anywhere from 30 seconds to 5 minutes, and I have also tried setting the interval to 0 for unlimited, so I only call the start_scan once. In this instance, my pplication received the ESP_GAP_SEARCH_INQ_CMPL_EVT event, so set a flag internally to call esp_ble_gap_start_scanning(30) again, which responded with ESP_OK, but I never received the ESP_GAP_BLE_SCAN_START_COMPLETE_EVT, and about 8 seconds later, I see an error log of command timeout.sdkconfig
sdkconfig
Scanning Configuration Used
These are the configurations currently in use, but as you'll see below I have tried many different
Changes Attempted
Below is a list of sdkconfig changes of application setup/operation changes that I have tried, with no success , the same thing occurs.
If there is anything else I should try, please let me know.
Apologies for the large Github issue, this error has been troubling me for some time and I would like to know what I can try next. Thank you.
Environment