Closed embedded-sm closed 8 months ago
Hi @embedded-sm ,
The bleprph / blecent examples are based on upstream nimble examples which demonstrate API usage. Since this is more related to performance testing, i think bleprph may not be the correct example in its current form. May i request you to please share OTA log , we will try to analyze it.
i start from bleprph / blecent examples because is more similar to our project and easy to reproduce the same behaviours, i added a timer task to write periodically to a characteristic.
the timer is set to 500ms so the issue is not strict related to the performance with high data bandwidth, as you can see from my log it lose messages for about 1-2seconds, ( 3 message in a row )
what you mean exactly with OTA log
? i can do other test and collect more data, but in preparation of testing your solution, can you also test the problem at your side?
extra log with the above patch: ble_cent.log ble_prph.log
after ble_prph reboot, the communication back to normal: ble_cent_OK.log ble_prph_OK.log
as proof that the packet is not lost for the weak signal
it's quite obvious, but the ble_cent become unstable also with only random reset of ble_prph
Hi @embedded-sm , As seen in the logs on the central side, after writing a few packets, the central observers an error during write operation, it seems to be a memory issue. (See last line)
After the central host receives response from the peripheral, it can send new data to controller, but the central is sending data before the peripheral is responding. (See Errors after 7, 8, and 9, but no error after receiving ACL response from peripheral)
Since your use case involves sending lots of data continuously, can you please increasing the buffer count and msys count to the maximum your system can support and try? As seen in logs, data does get freed and new data is being sent, but packets are getting lost due to non-availability of buffer. So if we queue buffers, it will assist to send the data once buffer is available?
I try right now, i let you know soon, but from how you write, it seems that this solution has not been tested by you? meanwhile I remind you that if I restart ble_cent, the problem remains, only way to recover from this state is to reboot the ble_prph. furthermore a single message every 500ms it not "sending lots of data continuously" as you said
increasing these on ble_cent not solve the issue, i notice the exact same behaviour (did you test it?):
instead increasing the same buffer on ble_prph look that the frequency to reproduce the issue is reduced, but happen anyhow (especially in our application) the probability that the problem occurs is inversely proportional to the size of the buffers count in ble_prph
the data traffic is very low, so it is strange that is required so many elements in the buffer, and it is strange that the problem occurs only after disconnection ( RF weak or ble_cent reset ) , for my opinion there is some memory leak or something hangs after disconnect
other clue: did you notice the weird time between ble_cent and ble_prph? the first messages on ble_cent are sent successfully with correct period( 500mS ) but ble_prph receive it with 2s delay between each others
Hi @embedded-sm, Thanks for sharing your observations. We too debugged this more and observed that the problem occurs due to the fact that host on central is pumping data at a rate faster than the speed at which buffers are getting freed.
To figure out the cause of the delay, we put timestamps and checked more. The issue is that during the first connection instance, packet is getting allocated on central and response is coming back from peripheral within 0.3 to 0.4 secs. This is enough, as central is sending packets every 0.5 secs.
But the problem happens after the first disconnection. In this case, packets take close to 2 secs for freeing up. The remote host actually processes the packet within 0.2 secs , but the total time for freeing up at central is close to 2 seconds.
We are trying to debug where this delay is coming for. Solutions like increasing threshold value to free up buffers quickly , bumping the priority of thread processing and changing the pool for buffer allocation helped, but this is not completely resolving the issue.
We will continue to root cause the reason for the large ( close to 2 sec) delay and that we think should help us identify the exact problem. Will keep the findings posted on this thread.
Btw, one query, which can ensure our testing sequence is same as yours, when you start observing the issue in your testing setup, are the devices still being moved around? OR once the disconnection happens, then the devices are brought back within range and kept stationary and still issue is observed in your original test setup?
Btw, one query, which can ensure our testing sequence is same as yours, when you start observing the issue in your testing setup, are the devices still being moved around? OR once the disconnection happens, then the devices are brought back within range and kept stationary and still issue is observed in your original test setup?
confirm, when the ble_prph start to have problem, both device will keep nearby and stationary. when the problem occurs none activity on ble_cent bring ble_prph to normal behaviours, the only way is to reboot the ble_prph
as i wrote in previous comment: you can replicate the problem only with a reset of the ble_cent (press RST button at random time) and the ble_prph start to have the same issue as moving it out of range , you can easy replicate the issue from the desk, these are a clear clues that something wrong happening with unexcepted stop of the communication
We debugged this issue more and took OTA logs. Below is the observation and debugging so far.
We are looping in concerned team members to check on this issue on priority. Attached are our logs for reference. Please note that the logs contain extra prints necessary for debugging. The ACL
print corresponds to the packets received by the host, and the At:
print corresponds to the time since boot that the data was received.
These are the files:
11011_Cent.log 11011_Prph.log 11011_OTA_Ellisys.zip
Note: You will need the Ellisys packet analyser to view the OTA logs.
May we request you to please share your OTA logs (if there is no BPA or ellisys, even wireshark logs at your end are ok) for the test? This will help us match our logs with yours and ensure we are on the same page.
Hi @embedded-sm, Can you please confirm if disabling BT_NIMBLE_HS_FLOW_CTRL (menuconfig->Component Config->Bluetooth->NimBLE Options->Enable Host Flow Control) on ESP32 (peripheral) side helps resolve the issue at your end, since it resolves the issue at our end.
on the example and with devkit look that the workaround to remove the flow control prevent the issue. now we are testing in our board and project.
will be helpful if you provide some deep documentation about the flow control on nimble, so we can evaluate the behaviour of having it disabled.
Hi @embedded-sm,
will be helpful if you provide some deep documentation about the flow control on nimble, so we can evaluate the behaviour of having it disabled.
Our change is to disable host to controller flow control. However, please note that controller to host flow control is still operational and it will ensure transfer of data happens smoothly. Please refer to BLUETOOTH CORE SPECIFICATION Version 5.3 | Vol 4, Part E, 4. HCI FLOW CONTROL for in-depth documentation.
Regarding the issue pointed by you where the reconnect is taking upto 2 seconds, we haven't seen this issue, but we will continue to reproduce. Can you please provide console and wireshark logs for assistance.
Hi @embedded-sm, Sincere apologies, the fact that flow control from one side is enabled is true, however, the one we proposed to disable is controller to host flow control and not host to controller flow control. Hence, host to controller is still enabled and will ensure that data transfer happens smoothly.
Also, from your sdkconfig, it can be seen that you have enabled flow control in both esp32 (peripheral) and esp32c3 (central), but by default we keep flow control on esp32c3 disabled. Hence I request you to disable flow control on esp32c3 as well.
Closing issue. Feel free to reopen for further queries.
Answers checklist.
General issue report
IDF version: git describe --tags v5.0.1
DevKit ble_prph: ESP32-Ethernet-Kit DevKit ble_cent: ESP32-C3-DevKit
source and sdkconfig of the 2 project to test [edit version of the example from eps-idf]: ble_cent.zip ble_phpr.zip
How to reproduce: move the Kit away and then move near each other a few times, until the connection is lost, after some time ble_prph start to lose messages, and the only way to recover is to reboot it. (reboot ble_cent NOT solve the issue)
to easy simulate the missing packet / disconnection due to RF on ble_cent the RF power has been reduced.
Log when the problem is present:
ble_cent:
ble_prph: