ble_hs_hci_cmd causes LoadStoreError

h2zero commented 4 years ago

Haven't had time to debug further but I have traced to the line at which NimBLE crashes when connected to multiple peripherals as client.

This happens frequently when connecting with low connection intervals (7.5ms).

When connecting to a 3rd peripheral and calling for an mtu exchange from the connect event I get a LoadStoreError on CPU 0 that seems to happen at this line.

When I get more time I'll try to get more information / test code.

prasad-alatkar commented 4 years ago

Hi @h2zero Thank you for raising the issue. Can you please share console logs (backtrace) when crash occurs ? It would be really helpful if you could share debug logs as well (menuconfig--> Component config --> Log output --> Default log verbosity to Debug). Apart from this please attach sdkconfig as well.

h2zero commented 4 years ago

Hi @prasad-alatkar, I meant to get those details posted sooner but haven't had time. Hopefully I can have something later on today.

h2zero commented 4 years ago

Had some spare time at work and managed to log the error.

GATT procedure initiated: exchange mtu
host tx hci data; handle=1 length=7
ble_hs_hci_acl_tx(): 0x01 0x00 0x07 0x00 0x03 0x00 0x04 0x00 0x02 0x00 0x01
txed att command: mtu req; conn=1 mtu=256
ble_hs_hci_cmd_send: ogf=0x08 ocf=0x0016 len=2
0x16 0x20 0x02 0x01 0x00
Command Status: status=0 cmd_pkts=5 ocf=0x16 ogf=0x8
Guru Meditation Error: Core  0 panic'ed (LoadStoreError). Exception was unhandled.
Core 0 register dump:
PC      : 0x4001742d  PS      : 0x00060533  A0      : 0x8004a7b4  A1      : 0x3ffb5d00
A2      : 0x3ffb3038  A3      : 0x40003096  A4      : 0x3ffb48f0  A5      : 0x3ffb2f84
A6      : 0x3ffb8360  A7      : 0x00000000  A8      : 0x8008b830  A9      : 0x3ffb5ce0
A10     : 0xffffffff  A11     : 0x00000001  A12     : 0x00060720  A13     : 0x00000021
A14     : 0x000001e0  A15     : 0x3ffb8360  SAR     : 0x0000001d  EXCCAUSE: 0x00000003
EXCVADDR: 0x40003096  LBEG    : 0x4000c46c  LEND    : 0x4000c477  LCOUNT  : 0xffffffff

ELF file SHA256: c4d5ecb8bad20e4ab42058c854ab08f96c972430066b87d5401e1a86fd2b8510

Backtrace: 0x4001742d:0x3ffb5d00 0x4004a7b1:0x3ffb5d20 0x400f8169:0x3ffb5d40 0x400f5f2b:0x3ffb5d60 0x400469b6:0x3ffb5db0 0x40019d11:0x3ffb5de0 0x40055b4d:0x
3ffb5e00 0x400f0adb:0x3ffb5e20 0x400f0fd1:0x3ffb5e40 0x4008f839:0x3ffb5e70
0x400f8169: lld_pdu_get_tx_flush_nb at ??:?

0x400f5f2b: r_lld_evt_elt_delete at ??:?

0x400f0adb: r_rw_schedule at ??:?

0x400f0fd1: btdm_controller_task at ??:?

0x4008f839: vPortTaskWrapper at C:/msys32/home/Ryan/esp/esp-idf/components/freertos/port.c:403

CPU halted.

This happens when connecting to a third device most of the time and mainly with low connection intervals (7.5ms guarantees the failure) as you can see in the log. Seems to happen most often when the connection parameters change as a similar log is printed when accepting an L2CAP_UPDATE_REQ as well.

My code flow is: scan -> find device -> connect (parameters specified in the call to ble_gap_connect()) -> exchange MTU -> read characteristics & subscribe for notifications -> scan for next device and repeat.

Note that this is also using IDF v3.2 as I have backported this repo, as well as the latest NimBLE related code in the IDF repo to use with Arduino. So possibly this is just an issue with that IDF version, I'll try with the latest IDF master branch later on.

h2zero commented 4 years ago

I believe I have resolved this issue in my code. It seems calling ble_gattc_exchange_mtu() in the connection event callback can cause this. When returning from the callback NimBLE calls ble_gap_rd_rem_sup_feat_tx() and this is when the error occurs.

I'm not sure why it happens, however returning from the callback then calling ble_gattc_exchange_mtu() after seems to prevent the error.

I will close this for now as it seems to be a user error 😃

h2zero commented 4 years ago

Closed this too soon it seems.

Just tested with IDF v4.0 and encountered the same problem.

I NimBLEScan: "NEW DEVICE FOUND: 30:ae:a4:7c:3c:a6"
I NimBLEScan: "UPDATING PREVIOUSLY FOUND DEVICE: 30:ae:a4:7c:3c:a6"
Advertised Device found: Name: , Address: 30:ae:a4:7c:3c:a6, advType: Undirected - Connectable / Scannable
I NimBLEScan: "NEW DEVICE FOUND: 30:ae:a4:cc:24:26"
I NimBLEScan: "UPDATING PREVIOUSLY FOUND DEVICE: 30:ae:a4:cc:24:26"
Advertised Device found: Name: NimBLE-Arduino, Address: 30:ae:a4:cc:24:26, serviceUUID: 0xdead, txPower: 3, advType: Undirected - Connectable / Scannable
I NimBLEAdvertisedDevice: "Comparing UUIDS: 0xdead 0xdead"
Found Our Service
Scan Ended
New client created
GAP procedure initiated: connect; peer_addr_type=0 peer_addr=30:ae:a4:cc:24:26 scan_itvl=16 scan_window=16 itvl_min=6 itvl_max=6 latency=0 supervision_timeout=54 min_ce_len=16 max_ce_len=768 own_addr_type=0
GATT procedure initiated: exchange mtu
Guru Meditation Error: Core  0 panic'ed (LoadStoreError). Exception was unhandled.
Core 0 register dump:
PC      : 0x4001742d  PS      : 0x00060133  A0      : 0x8004a7b4  A1      : 0x3ffb5e20
A2      : 0x3ffb3110  A3      : 0x40017c36  A4      : 0x3ffb49f4  A5      : 0x3ffb308c
A6      : 0x3ffb8360  A7      : 0x00000000  A8      : 0x8008c80c  A9      : 0x3ffb5e00
A10     : 0xffffffff  A11     : 0x00000001  A12     : 0x00060320  A13     : 0x00000021
A14     : 0x000001f4  A15     : 0x3ffb8360  SAR     : 0x0000001d  EXCCAUSE: 0x00000003
EXCVADDR: 0x40017c36  LBEG    : 0x4000c46c  LEND    : 0x4000c477  LCOUNT  : 0xffffffff

ELF file SHA256: 897112a24fd081f2f5896170305cf9af4df44915f01015d207e7d5fd1551cec3

Backtrace: 0x4001742a:0x3ffb5e20 |<-CORRUPTED

CPU halted.

After investigating the issue I found when it returns from here is when I get the error. Which seems to go back to the binary library.

I NimBLEScan: "UPDATING PREVIOUSLY FOUND DEVICE: 30:ae:a4:cc:24:26"
Advertised Device found: Name: NimBLE-Arduino, Address: 30:ae:a4:cc:24:26, serviceUUID: 0xdead, txPower: 3, advType: Undirected - Connectable / Scannable
I NimBLEAdvertisedDevice: "Comparing UUIDS: 0xdead 0xdead"
Found Our Service
E (8788) NimBLE: ***returning: host_rcv_pkt***
Scan Ended
New client created
GAP procedure initiated: connect; peer_addr_type=0 peer_addr=30:ae:a4:cc:24:26 scan_itvl=16 scan_window=16 itvl_min=6 itvl_max=6 latency=0 supervision_timeout=54 min_ce_len=16 max_ce_len=768 own_addr_type=0
E (8818) NimBLE: ***returning: host_rcv_pkt***
E (9038) NimBLE: ***returning: host_rcv_pkt***
GATT procedure initiated: exchange mtu
E (9038) NimBLE: ***returning: host_rcv_pkt***
Guru Meditation Error: Core  0 panic'ed (LoadStoreError). Exception was unhandled.
Core 0 register dump:
PC      : 0x4001742d  PS      : 0x00060133  A0      : 0x8004a7b4  A1      : 0x3ffb5e20
A2      : 0x3ffb3110  A3      : 0x4002863e  A4      : 0x3ffb49f4  A5      : 0x3ffb301c
A6      : 0x3ffb8360  A7      : 0x00000000  A8      : 0x8008c80c  A9      : 0x3ffb5e00
A10     : 0xffffffff  A11     : 0x00000001  A12     : 0x00060320  A13     : 0x00000021
A14     : 0x000001f0  A15     : 0x3ffb8360  SAR     : 0x0000001d  EXCCAUSE: 0x00000003
EXCVADDR: 0x4002863e  LBEG    : 0x4000c46c  LEND    : 0x4000c477  LCOUNT  : 0xffffffff

ELF file SHA256: 44b73b484bfe7894c8a8bc14cbf383afce00d5d4445d5420b93c12d7b1ac5eb3

Backtrace: 0x4001742a:0x3ffb5e20 |<-CORRUPTED

This IDF issue may also be related.

h2zero commented 4 years ago

Well after a bit of experimenting today I found a solution to this problem. Changing:

#define BLE_GAP_INITIAL_CONN_MIN_CE_LEN     0x0010
#define BLE_GAP_INITIAL_CONN_MAX_CE_LEN     0x0300

To:

#define BLE_GAP_INITIAL_CONN_MIN_CE_LEN     0x0000
#define BLE_GAP_INITIAL_CONN_MAX_CE_LEN     0x0000

Seems to have completely eliminated the issue.

I'm thinking that even though NimBLE does not use these parameters they are still being sent to the esp-32 controller, at which point it may be using them? The error seems to occur when responding to an BLE_GAP_EVENT_L2CAP_UPDATE_REQ, especially when the parameters get accepted. This seems to happen near the same time as the MTU exchange call my code made so it was masking it.

More specifically, when responding with acceptance to the peer parameters, NimBLE uses those defines for the min_ce_len and max_ce_len which of course get sent to the controller.

Further information is also in this PR expressing a related problem.

@prasad-alatkar Do you know if the controller uses these values, or am I just imagining things here?

prasad-alatkar commented 4 years ago

Hi @h2zero , as per the spec,

The Minimum_CE_Length and Maximum_CE_Length are information parameters providing the Controller with a hint about the expected minimum and maximum length of the connection events. The Minimum_CE_Length shall be less than or equal to the Maximum_CE_Length.

So, these are optional parameters and are indicative of expected connection event duration.

connect (parameters specified in the call to ble_gap_connect())

Here, min_ce_len & max_ce_len can be specified without changing macros, because there is no verification/validation of these parameters in ble_gap_connect(). Apart from that in ble_gap_rx_param_req, peer_params.min_ce_len & peer_params.max_ce_len are initialized to 0. This convinces me that the default values i.e. 16*0.625 = 10ms & 768*0.625 = 480 ms are not meant to be lower and upper boundaries for CE_length, but some initial values maybe. Please let me know if you find anything wrong.

I will have internal discussion on how controller uses these values and let you know.

h2zero commented 4 years ago

Thanks @prasad-alatkar, I realize those parameters can be changed in ble_gap_connect() and thanks for pointing out that they are initialized to 0 in ble_gap_rx_param_req(). That gives me a better idea why it works when I changed the macros.

I've done more testing on this and the issue is due to ble_l2cap_sig_update_req_rx() when connecting to a peripheral using L2CAP to request parameters. At this line you can see the macros are being used to initialize the min\max_ce_len.

So when I connect to 3 peripherals each requesting, via L2CAP, 20ms interval times and if I accept them, the response will also set the connection event length minimum to 10ms. It seems the esp32 controller then would be trying to allocate 3*10ms connection events inside 20ms connection intervals which produces the error, while if I only connect to 2 peripherals there is no issue.

On a side note, if I keep those parameters at 0 the performance, reliability and speed are improved drastically. Far less connection failed to be established errors and the speed of finding->(re)connecting is incredible.

Perhaps ble_l2cap_sig_update_req_rx() should initialize the parameters to 0, the same as ble_gap_rx_param_req()?

Then again those macros don't really serve a purpose elsewhere in NimBLE except in the case of calling ble_gap_connect() with NULL for parameters. As their controller does not use them as stated here.

prasad-alatkar commented 4 years ago

Hi @h2zero Thank you for your time and efforts, I am yet to establish what could be wrong with ble_l2cap_sig_update_req_rx even with the macros. Maybe @Sushant-Espressif can help us understand the controller's role when handling min_ce_len & max_ce_len.

h2zero commented 4 years ago

Sounds good, thanks. Sorry if I’m being a pain 😄. Too much time on my hands now that I’ve had to isolate myself for a week. I’d just like to see this library be the best it can.

prasad-alatkar commented 4 years ago

So when I connect to 3 peripherals each requesting, via L2CAP, 20ms interval times and if I accept them, the response will also set the connection event length minimum to 10ms. It seems the esp32 controller then would be trying to allocate 3*10ms connection events inside 20ms connection intervals which produces the error, while if I only connect to 2 peripherals there is no issue.

Makes sense !! I was able to reproduce the issue and could see improvement with changing min_ce_len & max_ce_len. I am comfortable with changing macros (at least minimum) to 0, as anyway they are optional. I will track the related discussion from your upstream NimBLE issue, accordingly we can take it forward.

h2zero commented 4 years ago

Glad you could reproduce it on your end. I'll bet you saw a nice performance boost too 😃.

h2zero commented 4 years ago

Closing this as upstream merged PR #808.

espressif / esp-nimble

ble_hs_hci_cmd causes LoadStoreError #4