espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.81k stars 7.32k forks source link

ble_gattc_disc_all_svcs() fails with BLE_HS_ETIMEOUT (IDFGH-13785) #14644

Open malachib opened 1 month ago

malachib commented 1 month ago

Answers checklist.

IDF version.

v5.3.1

Espressif SoC revision.

Chip is ESP32-C6FH4 (QFN32) (revision v0.1)

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

Seeed ESP32C6

Power Supply used.

USB

What is the expected behavior?

I expect ble_gattc_disc_all_svcs to return services of the remote peripheral to which we are connecting.

What is the actual behavior?

After successfully connecting to remote peripheral, ble_gattc_disc_all_svcs delays for a while and ultimately gives error BLE_HS_ETIMEOUT

Steps to reproduce.

Full project located here: https://bitbucket.org/malachib/playground.esp/src/master/src/PGESP-65/

Steps are initialize NimBLE as usual, then initiate connection via ble_gap_connect to remote peripheral. On gap event callback connection success, then call ble_gattc_disc_all_svcs

Peripheral is specifically via a laptop running bluez on Debian

Debug Logs.

No response

More Information.

rahult-github commented 1 month ago

Hi @malachib ,

Looking at description, looks like either the command to retrieve services didn't get sent to remote OR remote has not sent any response. Can you please share:

  1. Share btmon log on bluez side when issue seen ?
  2. Any console log for nimble side ? preferably with debug enabled to check what commands are being sent ?
  3. sdkconfig used at your end.
malachib commented 1 month ago

Thank you for your suggestions. Looking at the logs, I found something peculiar:

> ACL Data RX: Handle 64 flags 0x02 dlen 11                                    #7 [hci0] 10.594181
      ATT: Read By Group Type Request (0x10) len 6
        Handle range: 0x0001-0xffff
        Attribute group type: Primary Service (0x2800)
> HCI Event: LE Meta Event (0x3e) plen 19                                      #8 [hci0] 10.596082
      LE Connection Complete (0x01)
        Status: Success (0x00)
        Handle: 64
        Role: Peripheral (0x01)
        Peer address type: Public (0x00)
        Peer address: 54:32:04:21:55:D6 (Espressif Inc.)
        Connection interval: 50.00 msec (0x0028)
        Connection latency: 0 (0x0000)
        Supervision timeout: 2560 msec (0x0100)
        Central clock accuracy: 0x04

Seems to be requesting services before the connection event could be problematic. What do you think? Complete logs are in repo and their descriptions are in README.md, unless you prefer them posted inline here (they are big)

malachib commented 1 month ago

It appears that was the trouble. ble_gattc_disc_all_svcs cannot be called while servicing BLE_GAP_EVENT_CONNECT

Two workarounds I've seen are:

  1. Inject an artificial delay before calling ble_gattc_disc_all_svcs
  2. Mandate an MTU negotiation and call ble_gattc_disc_all_svcs during BLE_GAP_EVENT_MTU

I've even seen some people use vTaskDelay while servicing BLE_GAP_EVENT_CONNECT (the horror)

Both of these are kludgey because:

  1. Injecting arbitrary delays is just a bad practice
  2. We might not really need an MTU negotiation, so we are forcing extra wire overhead sometimes

What is the recommended way to use ble_gattc_disc_all_svcs in this context?

rahult-github commented 1 month ago

What is the recommended way to use ble_gattc_disc_all_svcs in this context?

You can refer to example here , which does invoke discovery after link has been established. Please try the same to see if it works in your case.

malachib commented 1 month ago

Fascinating, it uses BLE_GAP_EVENT_LINK_ESTAB a very new addition to the Espressif specific NimBLE fork. I'll give this a try and provide feedback. Thank you

malachib commented 1 month ago

Side note: the v5.3.1 version doesn't use or have access to BLE_GAP_EVENT_LINK_ESTAB. It initiates peer_disc_all directly on the BLE_GAP_EVENT_CONNECT event:

https://github.com/espressif/esp-idf/blob/v5.3.1/examples/bluetooth/nimble/blecent/main/main.c

malachib commented 1 month ago

@rahult-github BLE_GAP_EVENT_LINK_ESTAB approach tested and produces same ill effects as BLE_GAP_EVENT_CONNECT. Namely, it looks like discovery command still precedes connection command.

rahult-github commented 1 month ago

Hi @malachib ,

  1. The reason to point in code the way IDF examples do service discovery was to give a reference about working example for the mentioned operation.

  2. For your query about Nimble sending discovery BEFORE connection is not true. The reason is as below:

For central device to initiate connection , first host (say device a ) , sends command to controller . LE Connection complete event is posted by controller of both peripheral and central to its respective hosts.

image

Once connection complete event is received, the nimble stack generates a BLE_GAP_EVENT_CONNECT kind of event to indicate application that ACL link is completed. The procedure to start discovery happens only after this event has been received.

Another point to consider is that the hci command will be sent / received on a connection handle and the handle is created only after connection is established.

I used blecent application (with some minor changes ) and bluez on remote side, and in btmon log i do see first connection being created and after that ATT Read by Group type request seen. Attaching logs for reference.

blecent_log.txt btmon_log.txt

Given above explanation and observation, it is safe to say that nimble is not sending any hci command before acl link is created. Need to debug on remote side most probably to confirm why remote side is showing this behavoiur of sequence of packets .

malachib commented 1 month ago

Your logic tracks, though it's wild to consider that the BLE Linux stack malfunctions on that level. I'll run blecent on my side and report the results.

Please try my PGESP-65 project linked above to see if you can recreate the problem with my code base.

If the BT stack on my Linux machine really is at fault, then it sounds like you are suggesting I diagnose the BTMON tool itself. Any suggestions for how I might do that?

malachib commented 1 month ago

I'm hoping this issue has not been cast aside. I still await recommendations for diagnosing BTMON tool and your findings running PGESP-65 yourself. I'd be curious to know what your test environment looks like also.

My testing on this side so far:

My access to BT capable Linux devices is limited, working on getting an Android app running to test things that way.

rahult-github commented 1 month ago

I'm hoping this issue has not been cast aside.

No . Sorry for delay from our side, as we are taken up with some pre-committed tasks. Thanks for testing at your end and since you observe issue with even blecent application, we will try to check it again at our end and update. Please give us some time.

rahult-github commented 1 month ago

Hi @malachib ,

I spent some time to look at your application and i didn't find anything visually incorrect. I ran the example on C6 against release/v5.3 branch and it worked.

I updated your application to initiate disconnect and redo connection and keep doing this. Am attaching a log for btmon and console, where i do see discovery happening at all times.

Can you please have a look at the application file and ensure that i am using the same settings as yours ( i.e. there is no setup or incorrect application code issue ) , since looking at this log, i don't see any timeout error.

nimble_console.txt btmon_log.txt ble_cpp.txt [ have renamed it just for uploading purpose ]

malachib commented 1 month ago

Will do I will have a look at these and get back to you. Thank you

UPDATE 01NOV24

Still working on getting my testing up over here