espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.75k stars 7.3k forks source link

HCI: Create Connection request is not properly terminated which blocks BlueZ from reconnecting (IDFGH-6082) #7763

Open danergo opened 3 years ago

danergo commented 3 years ago

Environment

Development Kit: NodeMCU-ESP-32S Module or chip used: ESP-WROOM-32 IDF version (run git describe --tags to find it): v4.4-dev-2533-g83956ebbae Build System: idf.py Compiler version (run xtensa-esp32-elf-gcc --version to find it): xtensa-esp32-elf-gcc (crosstool-NG esp-2021r1) 8.4.0 Operating System: Linux Using an IDE?: No Power Supply: external 5V

Problem Description

I'm using ESP32 as a Bluetooth controller with the ESP-IDF provided hciuart sample: https://github.com/espressif/esp-idf/tree/master/examples/bluetooth/hci/controller_hci_uart_esp32, and BlueZ as host.

Whenever my application tries to establish a connection by sending a Create Connection, and I abort it early by sending a Create Connection Cancel HCI request, ESP32 won't respond to the Create Connection, which blocks BlueZ from sending out any other new Create Connection (to the same peer).

Only way to escape from this is to completely reset the HCI device.

Expected Behavior && Actual Behavior && Steps to reproduce

I have collected logs from a correctly working (Cypress) device to make it easier for you to fix this. I'm using here a fake Bluetooth peer address to make 100% sure it is not pingable and for also making sure this problem doesn't depend at all on any particular peer device.

1st image demonstrates the Page Timeout in case I don't early-abort the connection trial. Left side is from Cypress, right side is ESP32: image

2nd image demonstrates the early-aborted connection trial after the 1st Page Timeout, without resetting the HCI device. Still left side is Cypress, right is ESP32. Problem starts manifesting here: while Cypress responds to the Create Connection Cancel request with also sending "Connect Complete", ESP32 does not completes the connection which confuses BlueZ (will think ESP is still trying to connect): image

3rd image demonstrates the complete mess, because you haven't closed the earlier connection by responding Connect Complete, BlueZ doesn't send the new Create Connection request at all (It still "thinks" ESP is trying to fulfill the first request): image

Due to this, whenever you try connecting to a device which accidentally or intentionally having its Bluetooth disabled, you will not be able to connect to it even after its Bluetooth gets re-enabled.

danergo commented 3 years ago

Note: this is the root-cause of https://github.com/espressif/esp-idf/issues/7723.

danergo commented 3 years ago

Anybody? Please help me guys. This bug is in your closed software part!

xiewenxiang commented 3 years ago

Hi @danergo

Thank you very much for providing such detailed analysis. We will further analyze this scenario to confirm whether it is a problem and give you a reply as soon as possible.

danergo commented 3 years ago

Hi @xiewenxiang: could you check this one, by any chance yet?

Thank you!

BetterJincheng commented 3 years ago

Hi @danergo, a fix for this issue is done and waiting for approval.

danergo commented 3 years ago

Hi @BetterJincheng: do you see by any chance how much time will this take? Or how much time an approval takes usually?

Thank you

BetterJincheng commented 3 years ago

@danergo, it is hard to evaluate... If needed, I can provide you the fixed lib in advance. Let me know

danergo commented 3 years ago

@BetterJincheng: that sounds perfect, please do that and I can continue the testing. Thank you! You can also send it to danergo@gmail.com if that suits better.

BetterJincheng commented 3 years ago

libbtdm_app.zip @danergo

  1. Replace esp-idf/component/bt/controller/lib_esp32/esp32/libbtdm_app.a with the attachment.
  2. idf.py fullclean (or just rm -rf ./build)

Let me know if it works.

danergo commented 3 years ago

@BetterJincheng:

Sorry for delays, I just had time to validate your fix now: it seems working, I'll run my usual testing method for a couple of more days and report back.

Thanks a lot!

danergo commented 2 years ago

Bad news:

After several days of constant stress-test (HCI LE Advertisement scanning, and some GATT commands, and also some raw classic BT stuff) and ESP32 crashed:

[2021-11-19 16:19:05] Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.
[2021-11-19 16:19:05]
[2021-11-19 16:19:05] Core  0 register dump:
[2021-11-19 16:19:05] PC      : 0x400def1d  PS      : 0x00060130  A0      : 0x80049b9c  A1      : 0x3ffcc750
[2021-11-19 16:19:05] A2      : 0x07ffffff  A3      : 0x02fa9bd0  A4      : 0x0000000b  A5      : 0x00000001
[2021-11-19 16:19:05] A6      : 0x3ffcc7b0  A7      : 0x00000000  A8      : 0x3ffc9324  A9      : 0x3ffcc730
[2021-11-19 16:19:05] A10     : 0x0030bb00  A11     : 0x00000001  A12     : 0x00000000  A13     : 0x3ffafd68
[2021-11-19 16:19:05] A14     : 0x00000000  A15     : 0x3ffb8360  SAR     : 0x00000016  EXCCAUSE: 0x0000001c
[2021-11-19 16:19:05] EXCVADDR: 0x00000008  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0x00000000
[2021-11-19 16:19:05]
[2021-11-19 16:19:05]
[2021-11-19 16:19:05] Backtrace:0x400def1a:0x3ffcc750 0x40049b99:0x3ffcc7b0 0x400457cd:0x3ffcc7e0 0x4008483c:0x3ffcc820 0x40019d11:0x3ffcc850 0x40055b4d:0x3ffcc870 0x400d5f98:0x3ffcc890 0x400d65b8:0x3ffcc8b0 0x40092951:0x3ffcc8e0

I decoded the backtrace but it is not too verbose:

0x400def1a:  r_lld_evt_update_create at ??:?
0x40049b99: ?? ??:0
0x400457cd: ?? ??:0
0x4008483c: ke_task_schedule at ke_task.c:?
0x40019d11: ?? ??:0
0x40055b4d: ?? ??:0
0x400d5f98: r_rw_schedule at ??:?
0x400d65b8: btdm_controller_task at ??:?
0x40092951: vPortTaskWrapper at /home/user/esp/esp-idf/components/freertos/port/xtensa/port.c:130

That line in port.c: pxCode(pvParameters);

Update After a short while another crash arrived:

[2021-11-19 16:39:25] ASSERT_ERR(lm_env.local_name), in lm.c at line 79
[2021-11-19 16:39:25] Guru Meditation Error: Core  0 panic'ed (IllegalInstruction). Exception was unhandled.
[2021-11-19 16:39:25] Memory dump at 0x4008ee00: f01d020c 00004136 f01d0000
[2021-11-19 16:39:25] Core  0 register dump:
[2021-11-19 16:39:25] PC      : 0x4008ee07  PS      : 0x00060130  A0      : 0x80083725  A1      : 0x3ffcc760
[2021-11-19 16:39:25] A2      : 0x3f40491c  A3      : 0x00000000  A4      : 0x00000000  A5      : 0x3f404831
[2021-11-19 16:39:25] A6      : 0x0000004f  A7      : 0xfffffffd  A8      : 0x8000814b  A9      : 0x3ffcc6d0
[2021-11-19 16:39:25] A10     : 0x00000000  A11     : 0x3ffcc6f2  A12     : 0x3ffcc69f  A13     : 0x00000039
[2021-11-19 16:39:25] A14     : 0x00000000  A15     : 0x3ffcc6a3  SAR     : 0x00000004  EXCCAUSE: 0x00000000
[2021-11-19 16:39:25] EXCVADDR: 0x00000000  LBEG    : 0x40083699  LEND    : 0x400836a1  LCOUNT  : 0x00000000
[2021-11-19 16:39:25]
[2021-11-19 16:39:25]
[2021-11-19 16:39:25] Backtrace:0x4008ee04:0x3ffcc760 0x40083722:0x3ffcc780 0x400e556e:0x3ffcc7a0 0x4002509e:0x3ffcc7c0 0x400da89a:0x3ffcc800 0x4008483c:0x3ffcc820 0x40019d11:0x3ffcc850 0x40055b4d:0x3ffcc870 0x400d5f98:0x3ffcc890 0x400d65b8:0x3ffcc8b0 0x40092951:0x3ffcc8e0

Decoded:

0x4008ee04: r_assert at /home/user/esp/esp-idf/components/bt/controller/esp32/bt.c:1991
0x40083722: r_assert_err at ??:?
0x400e556e: r_LM_GetLocalNameSeg at ??:?
0x4002509e: ?? ??:0
0x400da89a: lc_lmp_rx_handler at ??:?
0x4008483c: ke_task_schedule at ke_task.c:?
0x40019d11: ?? ??:0
0x40055b4d: ?? ??:0
0x400d5f98: r_rw_schedule at ??:?
0x400d65b8: btdm_controller_task at ??:?
0x40092951: vPortTaskWrapper at /home/user/esp/esp-idf/components/freertos/port/xtensa/port.c:130

Rest is similar to previous.

Questions: 1.) Why are almost all lines not having line numbers? 2.) Can I change anything to have more info from this backtrace? (Or I can share my binary for you in case it helps).

Thank you!

danergo commented 2 years ago

Can you have a look at it? It's very annoying and I have to do hciconfig hci0 reset once every 2-3 days constantly.

Thank you.

danergo commented 2 years ago

Now ESP32 printed "scan_evt timeout", and I also did HAVE TO do hciconfig hci0 reset in order to use it. It's very unstable.

BetterJincheng commented 2 years ago

Bad news:

After several days of constant stress-test (HCI LE Advertisement scanning, and some GATT commands, and also some raw classic BT stuff) and ESP32 crashed:

[2021-11-19 16:19:05] Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.
[2021-11-19 16:19:05]
[2021-11-19 16:19:05] Core  0 register dump:
[2021-11-19 16:19:05] PC      : 0x400def1d  PS      : 0x00060130  A0      : 0x80049b9c  A1      : 0x3ffcc750
[2021-11-19 16:19:05] A2      : 0x07ffffff  A3      : 0x02fa9bd0  A4      : 0x0000000b  A5      : 0x00000001
[2021-11-19 16:19:05] A6      : 0x3ffcc7b0  A7      : 0x00000000  A8      : 0x3ffc9324  A9      : 0x3ffcc730
[2021-11-19 16:19:05] A10     : 0x0030bb00  A11     : 0x00000001  A12     : 0x00000000  A13     : 0x3ffafd68
[2021-11-19 16:19:05] A14     : 0x00000000  A15     : 0x3ffb8360  SAR     : 0x00000016  EXCCAUSE: 0x0000001c
[2021-11-19 16:19:05] EXCVADDR: 0x00000008  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0x00000000
[2021-11-19 16:19:05]
[2021-11-19 16:19:05]
[2021-11-19 16:19:05] Backtrace:0x400def1a:0x3ffcc750 0x40049b99:0x3ffcc7b0 0x400457cd:0x3ffcc7e0 0x4008483c:0x3ffcc820 0x40019d11:0x3ffcc850 0x40055b4d:0x3ffcc870 0x400d5f98:0x3ffcc890 0x400d65b8:0x3ffcc8b0 0x40092951:0x3ffcc8e0

I decoded the backtrace but it is not too verbose:

0x400def1a:  r_lld_evt_update_create at ??:?
0x40049b99: ?? ??:0
0x400457cd: ?? ??:0
0x4008483c: ke_task_schedule at ke_task.c:?
0x40019d11: ?? ??:0
0x40055b4d: ?? ??:0
0x400d5f98: r_rw_schedule at ??:?
0x400d65b8: btdm_controller_task at ??:?
0x40092951: vPortTaskWrapper at /home/user/esp/esp-idf/components/freertos/port/xtensa/port.c:130

That line in port.c: pxCode(pvParameters);

Update After a short while another crash arrived:

[2021-11-19 16:39:25] ASSERT_ERR(lm_env.local_name), in lm.c at line 79
[2021-11-19 16:39:25] Guru Meditation Error: Core  0 panic'ed (IllegalInstruction). Exception was unhandled.
[2021-11-19 16:39:25] Memory dump at 0x4008ee00: f01d020c 00004136 f01d0000
[2021-11-19 16:39:25] Core  0 register dump:
[2021-11-19 16:39:25] PC      : 0x4008ee07  PS      : 0x00060130  A0      : 0x80083725  A1      : 0x3ffcc760
[2021-11-19 16:39:25] A2      : 0x3f40491c  A3      : 0x00000000  A4      : 0x00000000  A5      : 0x3f404831
[2021-11-19 16:39:25] A6      : 0x0000004f  A7      : 0xfffffffd  A8      : 0x8000814b  A9      : 0x3ffcc6d0
[2021-11-19 16:39:25] A10     : 0x00000000  A11     : 0x3ffcc6f2  A12     : 0x3ffcc69f  A13     : 0x00000039
[2021-11-19 16:39:25] A14     : 0x00000000  A15     : 0x3ffcc6a3  SAR     : 0x00000004  EXCCAUSE: 0x00000000
[2021-11-19 16:39:25] EXCVADDR: 0x00000000  LBEG    : 0x40083699  LEND    : 0x400836a1  LCOUNT  : 0x00000000
[2021-11-19 16:39:25]
[2021-11-19 16:39:25]
[2021-11-19 16:39:25] Backtrace:0x4008ee04:0x3ffcc760 0x40083722:0x3ffcc780 0x400e556e:0x3ffcc7a0 0x4002509e:0x3ffcc7c0 0x400da89a:0x3ffcc800 0x4008483c:0x3ffcc820 0x40019d11:0x3ffcc850 0x40055b4d:0x3ffcc870 0x400d5f98:0x3ffcc890 0x400d65b8:0x3ffcc8b0 0x40092951:0x3ffcc8e0

Decoded:

0x4008ee04: r_assert at /home/user/esp/esp-idf/components/bt/controller/esp32/bt.c:1991
0x40083722: r_assert_err at ??:?
0x400e556e: r_LM_GetLocalNameSeg at ??:?
0x4002509e: ?? ??:0
0x400da89a: lc_lmp_rx_handler at ??:?
0x4008483c: ke_task_schedule at ke_task.c:?
0x40019d11: ?? ??:0
0x40055b4d: ?? ??:0
0x400d5f98: r_rw_schedule at ??:?
0x400d65b8: btdm_controller_task at ??:?
0x40092951: vPortTaskWrapper at /home/user/esp/esp-idf/components/freertos/port/xtensa/port.c:130

Rest is similar to previous.

Questions: 1.) Why are almost all lines not having line numbers? 2.) Can I change anything to have more info from this backtrace? (Or I can share my binary for you in case it helps).

Thank you!

The elf file will do help. please attach here. Thanks

danergo commented 2 years ago

@BetterJincheng:

Here is it (controller_hci_uart.elf): controller_hci_uart.zip

Thank you.

BetterJincheng commented 2 years ago

libbtdm_app.zip @danergo Please try this lib to verify whether it fixed the second problem. Thanks

danergo commented 2 years ago

@BetterJincheng: just to be on the same page, which is the second for you?

The one with "IllegalInstruction"?

Thank you!

danergo commented 2 years ago

@BetterJincheng: I have not yet updated the libbtdm_app, but I get another crash, now with backtrace corrupted.

This is the only path which give some lights:

ASSERT_PARAM(2048 0), in rwbt.c at line 381

I hope you can take a look into this also.

danergo commented 2 years ago

@BetterJincheng:

With your latest version, I'm getting a lot of error messages in dmesg:

[ 9392.795165] Bluetooth: hci0: command 0x200e tx timeout
[ 9414.395439] Bluetooth: hci0: command 0x200d tx timeout
[ 9420.251516] Bluetooth: hci0: command 0x1003 tx timeout
[ 9422.267543] Bluetooth: hci0: command 0x1001 tx timeout
[ 9422.386796] Bluetooth: hci0: unexpected event for opcode 0x200d
[ 9422.387385] Bluetooth: hci0: unexpected event for opcode 0x0c03
[ 9422.430117] Bluetooth: hci0: hardware error 0x00
[ 9422.432685] Bluetooth: hci0: unexpected event for opcode 0x0c0d
[ 9429.435631] Bluetooth: hci0: command 0x1003 tx timeout
[ 9431.451666] Bluetooth: hci0: command 0x1001 tx timeout
[ 9433.467680] Bluetooth: hci0: command 0x1009 tx timeout
[ 9442.555808] Bluetooth: hci0: command 0x1003 tx timeout
[ 9444.571824] Bluetooth: hci0: command 0x1001 tx timeout
[ 9446.587851] Bluetooth: hci0: command 0x1009 tx timeout
[ 9455.612041] Bluetooth: hci0: command 0x1003 tx timeout
[ 9457.627993] Bluetooth: hci0: command 0x1001 tx timeout
[ 9459.644091] Bluetooth: hci0: command 0x1009 tx timeout
[ 9468.668156] Bluetooth: hci0: command 0x1003 tx timeout
[ 9470.684187] Bluetooth: hci0: command 0x1001 tx timeout
[ 9472.700199] Bluetooth: hci0: command 0x1009 tx timeout
[ 9481.724306] Bluetooth: hci0: command 0x1003 tx timeout
[ 9483.740321] Bluetooth: hci0: command 0x1001 tx timeout
[ 9485.756347] Bluetooth: hci0: command 0x1009 tx timeout
[ 9502.108567] Bluetooth: hci0: command 0x1003 tx timeout
[ 9504.124582] Bluetooth: hci0: command 0x1001 tx timeout
[ 9506.140609] Bluetooth: hci0: command 0x1009 tx timeout
[ 9517.116769] Bluetooth: hci0: command 0x1003 tx timeout
[ 9519.132773] Bluetooth: hci0: command 0x1001 tx timeout
[ 9521.148808] Bluetooth: hci0: command 0x1009 tx timeout
danergo commented 2 years ago

Can you please look into it soon? I'm waiting for this fix since last year. Thank you.

danergo commented 2 years ago

Ping :)

danergo commented 2 years ago

I have more logs now:

With your latest version of library you sent, I got this:

< HCI Command: Create Connection (0x01|0x0005) plen 13                     #4450 [hci0] 11:21:47.231569
        Address: AA:BB:CC:DD:EE:FF (Liteon Technology Corporation)
        Packet type: 0xcc18
          DM1 may be used
          DH1 may be used
          DM3 may be used
          DH3 may be used
          DM5 may be used
          DH5 may be used
        Page scan repetition mode: R2 (0x02)
        Page scan mode: Mandatory (0x00)
        Clock offset: 0x0000
        Role switch: Allow slave (0x01)
> HCI Event: Command Status (0x0f) plen 4                                  #4451 [hci0] 11:21:47.234813
      Create Connection (0x01|0x0005) ncmd 5
        Status: Success (0x00)
< HCI Command: Create Connection Cancel (0x01|0x0008) plen 6               #4452 [hci0] 11:21:50.236441
        Address: AA:BB:CC:DD:EE:FF (Liteon Technology Corporation)
> HCI Event: Command Complete (0x0e) plen 10                               #4453 [hci0] 11:21:50.238166
      Create Connection Cancel (0x01|0x0008) ncmd 5
        Status: ACL Connection Already Exists (0x0b)
        Address: AA:BB:CC:DD:EE:FF (Liteon Technology Corporation)

Now, the problem with this is when I want to cancel the connection in step #4452 (Create Connection Cancel), I got the reply in #4453 (Command Complete), that ACL Connection Already Exists.

This breaks Bluetooth's specification, as if "ACL Connection" is already existing, why haven't you sent any Connect Complete HCI event?

danergo commented 2 years ago

It would be great if I can get any feedback or reply to this thread.

danergo commented 2 years ago

ping

danergo commented 2 years ago
< HCI Command: LE Create Connection (0x08|0x000d) plen 25                       #202072 [hci0] 13:57:10.737769
        Scan interval: 60.000 msec (0x0060)
        Scan window: 60.000 msec (0x0060)
        Filter policy: White list is not used (0x00)
        Peer address type: Public (0x00)
        Peer address: AA:BB:CC:DD:EE:FF (OUI AA-BB-CC)
        Own address type: Public (0x00)
        Min connection interval: 25.00 msec (0x0014)
        Max connection interval: 25.00 msec (0x0014)
        Connection latency: 0 (0x0000)
        Supervision timeout: 5000 msec (0x01f4)
        Min connection length: 0.000 msec (0x0000)
        Max connection length: 0.000 msec (0x0000)
< HCI Command: LE Create Connection Cancel (0x08|0x000e) plen 0                 #202073 [hci0] 13:57:14.604338
@ MGMT Event: Connect Failed (0x000d) plen 8                                   {0x0002} [hci0] 13:57:20.738171
        LE Address: AA:BB:CC:DD:EE:FF (OUI AA-BB-CC)
        Status: Not Connected (0x02)

Another violation of Bluetooth standard by your closed source stuff. No response to my "LE Create Connection Cancel".

And only after exactly 10 seconds comes the management event that connection is failed.

1. How can I change this 10 second time?

2. Please attach me a new build with debug log.

danergo commented 2 years ago

@BetterJincheng: can you please look into this?

Thank you!

danergo commented 2 years ago
< HCI Command: LE Create Connection (0x08|0x000d) plen 25                        #42191 [hci0] 06:35:04.599612
        Scan interval: 60.000 msec (0x0060)
        Scan window: 60.000 msec (0x0060)
        Filter policy: White list is not used (0x00)
        Peer address type: Public (0x00)
        Peer address: AA:BB:CC:DD:EE:FF (OUI AA-BB-CC)
        Own address type: Public (0x00)
        Min connection interval: 25.00 msec (0x0014)
        Max connection interval: 25.00 msec (0x0014)
        Connection latency: 0 (0x0000)
        Supervision timeout: 5000 msec (0x01f4)
        Min connection length: 0.000 msec (0x0000)
        Max connection length: 0.000 msec (0x0000)
< HCI Command: LE Create Connection Cancel (0x08|0x000e) plen 0                  #42192 [hci0] 06:35:07.272523
< HCI Command: LE Create Connection Cancel (0x08|0x000e) plen 0                  #42193 [hci0] 06:35:12.277259
> HCI Event: Command Status (0x0f) plen 4                                        #42194 [hci0] 06:35:14.598817
      LE Create Connection (0x08|0x000d) ncmd 3
        Status: Success (0x00)
> HCI Event: Command Complete (0x0e) plen 4                                      #42195 [hci0] 06:35:14.599115
      LE Create Connection Cancel (0x08|0x000e) ncmd 4
        Status: Success (0x00)

How can I cancel a(n) (LE) Create Connection before receiving its "Command Complete" event, 10 seconds later? @BetterJincheng: could you please help me?

danergo commented 2 years ago

@xiewenxiang: I have provided many more logs for your modified closed-source library.

I haven't received any feedback or support from your side since last November which is now 3 months ago.

This is really annoying and unfair.

danergo commented 2 years ago

ping

danergo commented 2 years ago

ping

danergo commented 2 years ago

ping

BetterJincheng commented 2 years ago

I have more logs now:

With your latest version of library you sent, I got this:

< HCI Command: Create Connection (0x01|0x0005) plen 13                     #4450 [hci0] 11:21:47.231569
        Address: AA:BB:CC:DD:EE:FF (Liteon Technology Corporation)
        Packet type: 0xcc18
          DM1 may be used
          DH1 may be used
          DM3 may be used
          DH3 may be used
          DM5 may be used
          DH5 may be used
        Page scan repetition mode: R2 (0x02)
        Page scan mode: Mandatory (0x00)
        Clock offset: 0x0000
        Role switch: Allow slave (0x01)
> HCI Event: Command Status (0x0f) plen 4                                  #4451 [hci0] 11:21:47.234813
      Create Connection (0x01|0x0005) ncmd 5
        Status: Success (0x00)
< HCI Command: Create Connection Cancel (0x01|0x0008) plen 6               #4452 [hci0] 11:21:50.236441
        Address: AA:BB:CC:DD:EE:FF (Liteon Technology Corporation)
> HCI Event: Command Complete (0x0e) plen 10                               #4453 [hci0] 11:21:50.238166
      Create Connection Cancel (0x01|0x0008) ncmd 5
        Status: ACL Connection Already Exists (0x0b)
        Address: AA:BB:CC:DD:EE:FF (Liteon Technology Corporation)

Now, the problem with this is when I want to cancel the connection in step #4452 (Create Connection Cancel), I got the reply in #4453 (Command Complete), that ACL Connection Already Exists.

This breaks Bluetooth's specification, as if "ACL Connection" is already existing, why haven't you sent any Connect Complete HCI event?

Sorry for delay. And I will now try my best to help you solve the problem. A question -> From 11:21:47.234813 to 11:21:50.236441, why hasn't the connection been established for a full three seconds? I do not know if there is something wrong with your application.

danergo commented 2 years ago

Hi, @BetterJincheng:

Connection establishment event should be coming from your library. Thats exactly the problem that this event is not coming out from your closed source library.

My application can do nothing with this. I'm still using your hci_uart_sample: it does only forwarding commands from my Linux box to your closed source library.

So if connection established event is no coming back, it means your closed source library is not handling it properly, but I can't say any more as I have no access to that library.

This way, it seems it breaks the Bluetooth Core Specification.

What we can try is to use some true verbose logging, which can help detection of this problem.

BetterJincheng commented 2 years ago

Hi @danergo,

But I really can not reproduce your issue. Here is my log. Wait for your reply.

< HCI Command: Create Connection (0x01|0x0005) plen 13                                                         #2663 [hci0] 1670.775888
        Address: AA:BB:CC:DD:EE:FF (OUI AA-BB-CC)
        Packet type: 0xcc18
          DM1 may be used
          DH1 may be used
          DM3 may be used
          DH3 may be used
          DM5 may be used
          DH5 may be used
        Page scan repetition mode: R2 (0x02)
        Page scan mode: Mandatory (0x00)
        Clock offset: 0x0000
        Role switch: Allow slave (0x01)
> HCI Event: Command Status (0x0f) plen 4                                                                      #2664 [hci0] 1670.779305
      Create Connection (0x01|0x0005) ncmd 5
        Status: Success (0x00)
< HCI Command: Create Connection Cancel (0x01|0x0008) plen 6                                                   #2665 [hci0] 1673.204458
        Address: AA:BB:CC:DD:EE:FF (OUI AA-BB-CC)
> HCI Event: Command Complete (0x0e) plen 10                                                                   #2666 [hci0] 1673.207398
      Create Connection Cancel (0x01|0x0008) ncmd 5
        Status: Success (0x00)
        Address: AA:BB:CC:DD:EE:FF (OUI AA-BB-CC)
> HCI Event: Connect Complete (0x03) plen 11                                                                   #2667 [hci0] 1673.207400
        Status: Unknown Connection Identifier (0x02)
        Handle: 65535
        Address: AA:BB:CC:DD:EE:FF (OUI AA-BB-CC)
        Link type: ACL (0x01)
        Encryption: Disabled (0x00)
danergo commented 2 years ago

Yes, this is the normal case, how it should work. I also have this 9/10 times (or 95/100 times).

However in some cases I got the Status: ACL Connection Already Exists (0x0b) for my Create Connection Cancel command.

danergo commented 2 years ago

Anyway, I'm trying right now with a different firmware (from Zephyr).

With the new libbtdm_app library, I got bunch of undefined symbols, can you help me resolving these please:

These are all undefined references, during link time:

coex_ble_adv_priority_high_set
ld_acl_evt_canceled_cbk
ld_acl_evt_start_cbk
ld_acl_evt_stop_cbk
lc_pwr_max_ind_handler
lc_pwr_incr_ind_handler
lc_pwr_decr_ind_handler
lc_acl_disc_ind_handler

Thank you

BetterJincheng commented 2 years ago

Please do not use the lib I provided on Nov 29, 2021 as it was just a tester. Use the original lib on branch v4.4-dev-2533-g83956ebbae.

BetterJincheng commented 2 years ago

I used a simple script to test these two command and did not get Status: ACL Connection Already Exists (0x0b). Any suggestions?

#!/bin/bash

for i in $(seq 1 100)
do
    # Create Connection
    sudo hcitool -i hci0 cmd 0x01 0x0005 0xff 0xee 0xdd 0xcc 0xbb 0xaa 0x18 0xcc 0x02 0x00 0x00 0x00 0x01;
    sleep 2s;
    # Create Connection Cancel
    sudo hcitool -i hci0 cmd 0x01 0x0008 0xff 0xee 0xdd 0xcc 0xbb 0xaa;
    sleep 1s;
done
danergo commented 2 years ago

Any suggestions?

You can try running this script parallelly, 2-3 instances, and let's see. It's kind of a "load-test", but still, if connection cancel reports that ACL connection exists, we should get notified about that. I'll also try reproducing it.

Use the original lib on branch v4.4-dev-2533-g83956ebbae

Ah, okay! And does this one contain your patch too?

Thank you!

danergo commented 2 years ago

Ping

danergo commented 2 years ago

@BetterJincheng: can you tell me now please which branch contains these or this patch(es)? Release is already at version 4.4, so it is containing this, or no? Thank you.

BetterJincheng commented 2 years ago

@danergo patch for what?

danergo commented 2 years ago

@BetterJincheng: you have sent me a "test" library on this post:

libbtdm_app.zip @danergo Please try this lib to verify whether it fixed the second problem. Thanks

The question is: which release branch is this merged to?

Thank you!

BetterJincheng commented 2 years ago

@danergo That lib is just for test. You had reported many more issues with the lib. We have not merged it into any release branch.

Due to the covid-19 pneumonia, we are working in our home now. We can not access the lib code until we get back to office.

danergo commented 2 years ago

@BetterJincheng: ah I see. Thanks for confirming.

So currently, the best to use this library is to clone the release version4.4 and replace the library inside?

danergo commented 2 years ago

@BetterJincheng ping

danergo commented 2 years ago

Which is the EARLIEST branch contains these fixes? Please let me know asap. Thank you. It doesnt have to be release branch.

danergo commented 2 years ago

Do you have some better support channel? Like discord? Im getting sick on this issue since last October. Thank you.

BetterJincheng commented 2 years ago

As I said, that lib was just a test. Since you reported more problems based on that lib, we will not merge that lib. Now, all release branches have fixed the original problem of 'Create Connection Cancel'. Since the lib I provided you for testing purposes has caused more problems, I suggest you never use that lib again. As for the remaining problems (0x0b), we will solve them as soon as we return to the office. Now, what we can do is very limited.

danergo commented 2 years ago

Ah, okay, I got your point now. Indeed checking out a 4.4.1 release branch checks out a submodule of libesp32, which seems to have your fix for the initial problem of 'Create Connection Cancel'.

I will retest this now, as there was a major issue on the serial port which ESP32 used - to be fair, it could cause package loss. Now, the serial port issue has been fixed, and release branch has the fix for the 'Create Connection Cancel' problem.

Thank you very much, I'll feedback here once I have some news, I'm having high hopes now on closing this issue soon.

Best wishes for you!