Nimble ble_prph unstable after disconnection. (IDFGH-9670)

embedded-sm commented 1 year ago

Answers checklist.

[X] I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
[X] I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
[X] I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

IDF version: git describe --tags v5.0.1

DevKit ble_prph: ESP32-Ethernet-Kit DevKit ble_cent: ESP32-C3-DevKit

source and sdkconfig of the 2 project to test [edit version of the example from eps-idf]: ble_cent.zip ble_phpr.zip

How to reproduce: move the Kit away and then move near each other a few times, until the connection is lost, after some time ble_prph start to lose messages, and the only way to recover is to reboot it. (reboot ble_cent NOT solve the issue)

to easy simulate the missing packet / disconnection due to RF on ble_cent the RF power has been reduced.

Log when the problem is present:

ble_cent:

I (3348) NimBLE_BLE_CENT: ble_timer_callback[2]
I (3348) NimBLE: GATT procedure initiated: write; 
I (3348) NimBLE: att_handle=27 len=508
I (3848) NimBLE_BLE_CENT: ble_timer_callback[3]
I (3848) NimBLE: GATT procedure initiated: write; 
I (3848) NimBLE: att_handle=27 len=508
I (4348) NimBLE_BLE_CENT: ble_timer_callback[4]
I (4348) NimBLE: GATT procedure initiated: write; 
I (4348) NimBLE: att_handle=27 len=508
I (4848) NimBLE_BLE_CENT: ble_timer_callback[5]
I (4848) NimBLE: GATT procedure initiated: write; 
I (4848) NimBLE: att_handle=27 len=508
I (5348) NimBLE_BLE_CENT: ble_timer_callback[6]
E (5348) NimBLE_BLE_CENT: ble_gattc_write_flat[6]
I (5848) NimBLE_BLE_CENT: ble_timer_callback[7]
E (5848) NimBLE_BLE_CENT: ble_gattc_write_flat[6]
I (6348) NimBLE_BLE_CENT: ble_timer_callback[8]
E (6348) NimBLE_BLE_CENT: ble_gattc_write_flat[6]
I (6848) NimBLE_BLE_CENT: ble_timer_callback[9]
I (6848) NimBLE: GATT procedure initiated: write; 
I (6848) NimBLE: att_handle=27 len=508
I (7348) NimBLE_BLE_CENT: ble_timer_callback[10]
E (7348) NimBLE_BLE_CENT: ble_gattc_write_flat[6]
I (7848) NimBLE_BLE_CENT: ble_timer_callback[11]
E (7848) NimBLE_BLE_CENT: ble_gattc_write_flat[6]
I (8348) NimBLE_BLE_CENT: ble_timer_callback[12]
E (8348) NimBLE_BLE_CENT: ble_gattc_write_flat[6]
I (8848) NimBLE_BLE_CENT: ble_timer_callback[13]
I (8848) NimBLE: GATT procedure initiated: write; 
I (8848) NimBLE: att_handle=27 len=508
I (9348) NimBLE_BLE_CENT: ble_timer_callback[14]
E (9348) NimBLE_BLE_CENT: ble_gattc_write_flat[6]
I (9848) NimBLE_BLE_CENT: ble_timer_callback[15]
E (9848) NimBLE_BLE_CENT: ble_gattc_write_flat[6]
I (10348) NimBLE_BLE_CENT: ble_timer_callback[16]
E (10348) NimBLE_BLE_CENT: ble_gattc_write_flat[6]
I (10848) NimBLE_BLE_CENT: ble_timer_callback[17]
I (10848) NimBLE: GATT procedure initiated: write; 
I (10848) NimBLE: att_handle=27 len=508
I (11348) NimBLE_BLE_CENT: ble_timer_callback[18]
E (11348) NimBLE_BLE_CENT: ble_gattc_write_flat[6]
I (11848) NimBLE_BLE_CENT: ble_timer_callback[19]
E (11848) NimBLE_BLE_CENT: ble_gattc_write_flat[6]
I (12348) NimBLE_BLE_CENT: ble_timer_callback[20]
E (12348) NimBLE_BLE_CENT: ble_gattc_write_flat[6]
I (12848) NimBLE_BLE_CENT: ble_timer_callback[21]
I (12848) NimBLE: GATT procedure initiated: write; 
I (12848) NimBLE: att_handle=27 len=508
I (13348) NimBLE_BLE_CENT: ble_timer_callback[22]
E (13348) NimBLE_BLE_CENT: ble_gattc_write_flat[6]
I (13848) NimBLE_BLE_CENT: ble_timer_callback[23]
E (13848) NimBLE_BLE_CENT: ble_gattc_write_flat[6]
I (14348) NimBLE_BLE_CENT: ble_timer_callback[24]
E (14348) NimBLE_BLE_CENT: ble_gattc_write_flat[6]
I (14848) NimBLE_BLE_CENT: ble_timer_callback[25]
I (14848) NimBLE: GATT procedure initiated: write; 
I (14848) NimBLE: att_handle=27 len=508

ble_prph:

I (907079) NimBLE: gatt_svr_chr_access_sec_test
I (907079) NimBLE: gatt_svr_chr_write[508], rc=0, data=0
I (909109) NimBLE: gatt_svr_chr_access_sec_test
I (909109) NimBLE: gatt_svr_chr_write[508], rc=0, data=1
I (911079) NimBLE: gatt_svr_chr_access_sec_test
I (911079) NimBLE: gatt_svr_chr_write[508], rc=0, data=2
I (913079) NimBLE: gatt_svr_chr_access_sec_test
I (913079) NimBLE: gatt_svr_chr_write[508], rc=0, data=3
I (915149) NimBLE: gatt_svr_chr_access_sec_test
I (915149) NimBLE: gatt_svr_chr_write[508], rc=0, data=4
I (917109) NimBLE: gatt_svr_chr_access_sec_test
I (917109) NimBLE: gatt_svr_chr_write[508], rc=0, data=5
I (919109) NimBLE: gatt_svr_chr_access_sec_test
I (919109) NimBLE: gatt_svr_chr_write[508], rc=0, data=9
I (921129) NimBLE: gatt_svr_chr_access_sec_test
I (921129) NimBLE: gatt_svr_chr_write[508], rc=0, data=13
I (923129) NimBLE: gatt_svr_chr_access_sec_test
I (923129) NimBLE: gatt_svr_chr_write[508], rc=0, data=17
I (925109) NimBLE: gatt_svr_chr_access_sec_test
I (925109) NimBLE: gatt_svr_chr_write[508], rc=0, data=21

rahult-github commented 1 year ago

Hi @embedded-sm ,

The bleprph / blecent examples are based on upstream nimble examples which demonstrate API usage. Since this is more related to performance testing, i think bleprph may not be the correct example in its current form. May i request you to please share OTA log , we will try to analyze it.

embedded-sm commented 1 year ago

i start from bleprph / blecent examples because is more similar to our project and easy to reproduce the same behaviours, i added a timer task to write periodically to a characteristic.

the timer is set to 500ms so the issue is not strict related to the performance with high data bandwidth, as you can see from my log it lose messages for about 1-2seconds, ( 3 message in a row )

what you mean exactly with OTA log? i can do other test and collect more data, but in preparation of testing your solution, can you also test the problem at your side?

embedded-sm commented 1 year ago

ACL_Prints.patch

extra log with the above patch: ble_cent.log ble_prph.log

after ble_prph reboot, the communication back to normal: ble_cent_OK.log ble_prph_OK.log

as proof that the packet is not lost for the weak signal

embedded-sm commented 1 year ago

it's quite obvious, but the ble_cent become unstable also with only random reset of ble_prph

SumeetSingh19 commented 1 year ago

Hi @embedded-sm , As seen in the logs on the central side, after writing a few packets, the central observers an error during write operation, it seems to be a memory issue. (See last line)

After the central host receives response from the peripheral, it can send new data to controller, but the central is sending data before the peripheral is responding. (See Errors after 7, 8, and 9, but no error after receiving ACL response from peripheral)

Since your use case involves sending lots of data continuously, can you please increasing the buffer count and msys count to the maximum your system can support and try? As seen in logs, data does get freed and new data is being sent, but packets are getting lost due to non-availability of buffer. So if we queue buffers, it will assist to send the data once buffer is available?

embedded-sm commented 1 year ago

I try right now, i let you know soon, but from how you write, it seems that this solution has not been tested by you? meanwhile I remind you that if I restart ble_cent, the problem remains, only way to recover from this state is to reboot the ble_prph. furthermore a single message every 500ms it not "sending lots of data continuously" as you said

embedded-sm commented 1 year ago

increasing these on ble_cent not solve the issue, i notice the exact same behaviour (did you test it?):

CONFIG_BT_NIMBLE_MSYS_1_BLOCK_COUNT
CONFIG_BT_NIMBLE_MSYS_2_BLOCK_COUNT
CONFIG_BT_NIMBLE_ACL_BUF_COUNT

instead increasing the same buffer on ble_prph look that the frequency to reproduce the issue is reduced, but happen anyhow (especially in our application) the probability that the problem occurs is inversely proportional to the size of the buffers count in ble_prph

the data traffic is very low, so it is strange that is required so many elements in the buffer, and it is strange that the problem occurs only after disconnection ( RF weak or ble_cent reset ) , for my opinion there is some memory leak or something hangs after disconnect

embedded-sm commented 1 year ago

other clue: did you notice the weird time between ble_cent and ble_prph? the first messages on ble_cent are sent successfully with correct period( 500mS ) but ble_prph receive it with 2s delay between each others

SumeetSingh19 commented 1 year ago

Hi @embedded-sm, Thanks for sharing your observations. We too debugged this more and observed that the problem occurs due to the fact that host on central is pumping data at a rate faster than the speed at which buffers are getting freed.

To figure out the cause of the delay, we put timestamps and checked more. The issue is that during the first connection instance, packet is getting allocated on central and response is coming back from peripheral within 0.3 to 0.4 secs. This is enough, as central is sending packets every 0.5 secs.

But the problem happens after the first disconnection. In this case, packets take close to 2 secs for freeing up. The remote host actually processes the packet within 0.2 secs , but the total time for freeing up at central is close to 2 seconds.

We are trying to debug where this delay is coming for. Solutions like increasing threshold value to free up buffers quickly , bumping the priority of thread processing and changing the pool for buffer allocation helped, but this is not completely resolving the issue.

We will continue to root cause the reason for the large ( close to 2 sec) delay and that we think should help us identify the exact problem. Will keep the findings posted on this thread.

Btw, one query, which can ensure our testing sequence is same as yours, when you start observing the issue in your testing setup, are the devices still being moved around? OR once the disconnection happens, then the devices are brought back within range and kept stationary and still issue is observed in your original test setup?

embedded-sm commented 1 year ago

Btw, one query, which can ensure our testing sequence is same as yours, when you start observing the issue in your testing setup, are the devices still being moved around? OR once the disconnection happens, then the devices are brought back within range and kept stationary and still issue is observed in your original test setup?

confirm, when the ble_prph start to have problem, both device will keep nearby and stationary. when the problem occurs none activity on ble_cent bring ble_prph to normal behaviours, the only way is to reboot the ble_prph

as i wrote in previous comment: you can replicate the problem only with a reset of the ble_cent (press RST button at random time) and the ble_prph start to have the same issue as moving it out of range , you can easy replicate the issue from the desk, these are a clear clues that something wrong happening with unexcepted stop of the communication

SumeetSingh19 commented 1 year ago

We debugged this issue more and took OTA logs. Below is the observation and debugging so far.

We captured OTA logs and in logs after the first disconnection, we indeed were able to see the delay. However, in our testing we are observing that ESP32C3 keeps retransmitting the packet for a longer duration, indicating that it did not receive the response from ESP32.
For testing, we replaced ESP32 with esp32c3 (i.e. esp32c3 on both side) and in our test, for same application, observed no such issue. Hence, this is may be tied up with ESP32.
We took timestamps at host, and observe that host is processing and sending data within permissible time ( 0.2 - 0.3 secs ). So this doesn't seem to be issue from host.

We are looping in concerned team members to check on this issue on priority. Attached are our logs for reference. Please note that the logs contain extra prints necessary for debugging. The ACL print corresponds to the packets received by the host, and the At: print corresponds to the time since boot that the data was received. These are the files:

11011_Cent.log 11011_Prph.log 11011_OTA_Ellisys.zip

Note: You will need the Ellisys packet analyser to view the OTA logs.

May we request you to please share your OTA logs (if there is no BPA or ellisys, even wireshark logs at your end are ok) for the test? This will help us match our logs with yours and ensure we are on the same page.

SumeetSingh19 commented 1 year ago

Hi @embedded-sm, Can you please confirm if disabling BT_NIMBLE_HS_FLOW_CTRL (menuconfig->Component Config->Bluetooth->NimBLE Options->Enable Host Flow Control) on ESP32 (peripheral) side helps resolve the issue at your end, since it resolves the issue at our end.

embedded-sm commented 1 year ago

on the example and with devkit look that the workaround to remove the flow control prevent the issue. now we are testing in our board and project.

will be helpful if you provide some deep documentation about the flow control on nimble, so we can evaluate the behaviour of having it disabled.

SumeetSingh19 commented 1 year ago

Hi @embedded-sm,

will be helpful if you provide some deep documentation about the flow control on nimble, so we can evaluate the behaviour of having it disabled.

Our change is to disable host to controller flow control. However, please note that controller to host flow control is still operational and it will ensure transfer of data happens smoothly. Please refer to BLUETOOTH CORE SPECIFICATION Version 5.3 | Vol 4, Part E, 4. HCI FLOW CONTROL for in-depth documentation.

Regarding the issue pointed by you where the reconnect is taking upto 2 seconds, we haven't seen this issue, but we will continue to reproduce. Can you please provide console and wireshark logs for assistance.

SumeetSingh19 commented 1 year ago

Hi @embedded-sm, Sincere apologies, the fact that flow control from one side is enabled is true, however, the one we proposed to disable is controller to host flow control and not host to controller flow control. Hence, host to controller is still enabled and will ensure that data transfer happens smoothly.

Also, from your sdkconfig, it can be seen that you have enabled flow control in both esp32 (peripheral) and esp32c3 (central), but by default we keep flow control on esp32c3 disabled. Hence I request you to disable flow control on esp32c3 as well.

rahult-github commented 8 months ago

Closing issue. Feel free to reopen for further queries.

espressif / esp-idf

Nimble ble_prph unstable after disconnection. (IDFGH-9670) #11011

Answers checklist.

General issue report