Scottapotamas commented 6 months ago

Answers checklist.

[X] I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
[X] I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
[X] I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

I've been working on a series of latency benchmarks for different wireless radios/stacks, and measured some odd behaviours from NimBLE when compared to Bluedroid for GATT/SPP style transfers.

These measurements are for one-way transfer latency - no ack/response behaviour is implemented or measured.

esp32-spp-results

esp32-ble-results

esp32-nimble-results

Is this behaviour reasonable/expected for the NimBLE + ESP32 stack? Any suggestions?

Comparing server notify and client WriteNoResp is also inconsistent between Bluedroid and NimBLE:

esp32-ble-server-client-directionality

esp32-nimble-server-client-directionality

Reproduction Notes

Software

Code for esp32-spp (classic), esp32-ble, esp32-nimble is on GitHub. These vary between minor changes from Espressif examples to heavier modifications to achieve feature parity.

The biggest difference from IDF examples: I've removed UART bridge behaviour, test payloads are handled directly on device.

Where test payloads exceed MTU, I manually send them as smaller MTU sized packets where the benchmark task requires some kind of library level event to signal the next packet i.e. BLE_GAP_EVENT_NOTIFY_TX, similar to the approach used in Espressif throughput example.

I originally ran these tests with IDF 5.1.1 but can reproduce them with latest v5.3-dev-892-g692c1fcc52 which is ~3 days old.

docker run -i --privileged --rm -v $PWD:/project -w /project -it espressif/idf:latest

Other relevant changes with menuconfig:

Change NimBLE internal logging to "Warning" to suppress extra terminal output (which affects results).
Change compiler options to build for performance -O2

Test Setup

Two ESP32-WROOM-32 devkit boards, positioned on table 1m apart.
Signal generator provides a 3.3V 50 μs square pulse at configurable interval (250 ms for all tests) as stimulus signal. Connected to IO19, handled via GPIO ISR.
When trigger is flagged, and BT connection is already established, send test payload
Receiving board rx event occurs, payload data is sent to user's benchmark task via FreeRTOS queue.
Benchmark task checks payload for length and CRC, if valid, drives IO18 high to signal valid transmission.
Saleae Logic 8 captures trigger and complete signals at 100 MS/s (10 ns resolution).

I've measured trigger-to-output overhead at ~4.11 μs when tested in a loopback configuration.

All firmware variants support both server notify and client writes, so swapping the trigger/valid signal connections allows testing client-server direction as needed.

Scottapotamas commented 6 months ago

After a bit more testing I worked out that manually specifying connection parameters is needed to allow NimBLE to match Bluedroid's defaults.

    // Sets the client's BLE connection behaviours 
    // https://mynewt.apache.org/latest/network/ble_hs/ble_gap.html#c.ble_gap_update_params
    // ITVL uses 1.25 ms units
    // Timout is in 10ms units
    // CE LEN uses 0.625 ms units
    // BLE specifies minimum 7.5ms connection interval
    struct ble_gap_upd_params conn_parameters = { 0 };
    conn_parameters.itvl_min = 6;   // 7.5ms
    conn_parameters.itvl_max = 24;  // 30ms
    conn_parameters.latency = 0;
    conn_parameters.supervision_timeout = 20; 
    // https://github.com/apache/mynewt-nimble/issues/793#issuecomment-616022898
    conn_parameters.min_ce_len = 0x00;
    conn_parameters.max_ce_len = 0x00;

    ble_gap_update_params(peer->conn_handle, &conn_parameters);

This improves results substantially.

esp32-nimble-notify-write-override-connparams

esp32-nimble-override-connparams

Remaining comments for Espressif:

While the lower bound latencies for both Notify and Writes is now in-line with BLE 7.5 ms interval minimums, NimBLE is still typically slower and has wider spread of latencies.
The 1KiB test (sent in smaller chunked packets due to MTU=200) is still far slower than Bluedroid.
I found the only mention of this in the blecent_throughput example.

xyzzy42 commented 6 months ago

Have you verified the actual LL packet MTU is increased to 200?

I've found that using NimBLE on ESP32-S3 (other ESP32s untested) that the LL MTU is not increased. Calling ble_att_set_preferred_mtu() and/or ble_gattc_exchange_mtu() will only change the ATT layer MTU. The larger ATT packets will still be fragmented via L2CAP into 27 byte LL packets. Which is of course quite disastrous for performance.

To get larger LL packets, it's necessary to send an HCI command to increase the controller's connInitialMaxTxOctets (I tested this) or the connection's connMaxTxOctets (probably, I haven't tested) value.

Other BLE stacks I've used haven't required this. I think the flaw is in Espressif's controller implementation. I think it's expected to increase connMaxTxOctets in response to receiving a LL_LENGTH_REQ PDU. It doesn't do this. The NimBLE controller (not used on ESP32) does this. The BT core spec (Ver 5.3, Vol 6, Part B, §5.1.9 "Data Length Update procedure") seems to imply the controller should do this.

KaeLL commented 5 months ago

@rahult-github thoughts?

Scottapotamas commented 5 months ago

@xyzzy42 Thanks for chiming in, your hint led me down the right path.

Sniffing the transfers, I see the 200 byte MTU update packets during connection, but for the larger transfers i.e. 128B, I still saw fragmented 26B LL transfers as you described:

I was able to resolve this issue by calling

#define LL_PACKET_TIME (2120)
#define LL_PACKET_LENGTH (200)
// ...
ble_hs_hci_util_set_data_len( event->connect.conn_handle, LL_PACKET_LENGTH, LL_PACKET_TIME );

inside BLE_GAP_EVENT_CONNECT on successful connection.

This was needed on both the server and client boards. Wireshark trace shows a single packet for the 128B test (pictured below) and correctly used 6x single notification packets which matches the expected application-side 1024B chunking behaviour.

Here's a comparison between these different changes (HCI data also includes conn parameter changes)

esp32-nimble-fixes

With the HCI length call, the 1024B test is twice as quick (halved latency), with minimal improvements to the smaller one-packet sized tests. This brings it roughly in line with Bluedroid's default performance.

Updated comments for Espressif:

There is little/no documentation for much of NimBLE which makes finding these kinds of API calls hard.
I'd probably consider the loss of MTU length across ESP-NimBLE VHCI to be a bug.
Mentioning default behaviour and performance would save others from needing to troubleshoot poor default performance, I'd suggest a section similar to the IDF Speed Optimization Docs?

xyzzy42 commented 5 months ago

You can also call ble_gap_write_sugg_def_data_len() before creating any connections to set the initial MTU length to be longer.

These two functions are totally missing in any documentation. Even things like Espressif's "How do I increase the MTU?" FAQ does not mention them.

In other BLE stacks I've used, this isn't necessary on both peers in the connection, as it is with ESP32+NimBLE. Only one, the GATT client, needs to send the length request. Then the other peer will increase the MTU in response to that.

I'm not entirely sure if it should be the NimBLE host or the Espressif controller which should be doing this. I think it's the controller. But I'm pretty sure a direct HCI call via a barely known function in application code is not the correct way.

Scottapotamas commented 5 months ago

Thanks for the suggestion.

Yeah I agree that calling against HCI functions from application space is a bad idea™.

The idiomatic NimBLE approach (mynewt-nimble source):

ble_gap_set_data_len() wraps the ble_hs_hci_util_set_data_len() function I hacked with above. I tested it and got the same behaviour.
ble_gap_write_sugg_def_data_len() doesn't seem to impact my test code and/or ESP-WROOM-32 at all.

This was only a detail I found while testing NimBLE as a subset of other benchmarks, so I'm content with the 'fixed' results and can move on to other things.

I'd like to see an official statement/explanation and some improvements for future users though.

xyzzy42 commented 5 months ago

ble_gap_write_sugg_def_data_len() doesn't seem to impact my test code and/or ESP-WROOM-32 at all.

Do you mean it does have a difference between using ble_gap_set_data_len() or that it doesn't cause the MTU to increase?

I tested this on ESP32-S3 and ble_gap_write_sugg_def_data_len() did work to increase the MTU for new connections. If I read the HCI spec correctly, it must be called before the connection is established.

Scottapotamas commented 5 months ago

I tested ble_gap_write_sugg_def_data_len() in a few places (applied to both boards):

fairly early in setup, prior to nimble_port_run()
after setup, i.e. advertise/scan has started, prior to connection
at time of connection

and I still saw LL fragmentation in Wireshark captures. Might be a subtlety I'm missing there.

xyzzy42 commented 5 months ago

I'm calling after esp_nimble_hci_init() and nimble_port_run() and before ble_gap_adv_start(). Also I'm using ESP32-S3.

I wonder if this is a difference between the controller for the ESP32 vs the ESP32-S3? I also wonder, if the ESP32 doesn't support this HCI command, if it returns an error code? Since these commands never appear in any Espressif documentation, I doubt any difference in controller support between chips is documented either.

xyzzy42 commented 1 week ago

I found a new problem, which might explain some of the differences seen. Setting ble_gap_write_sugg_def_data_len() after nimble_port_run, but before advertising starts worked to get a larger LL MTU sometimes, but not always.

Android phone, unbonded. Works.
Android phone, bonded. Works.
iPhone, unbonded. Works.
iPhone, bonded. Fails!

I.e., it doesn't work when the client is an iPhone that is already bonded to the server (ESP32).

An examination of the packet capture shows than in the working cases, the LL_LENGTH_REQ packet is sent from the client to the server while the connection is still unencrypted. In case 2, the connection is encrypted after this request and in case 1 & 3 the connection remains unencrypted.

But in case 4, the connection is encrypted first, and then the LL_LENGTH_REQ packet is sent. It's the first packet sent after the encryption handshake finishes. Something in the Espressif controller does not like this, and responds to the request with LL_REJECT_EXT_IND, LMP PDU Not Allowed.

I don't know the encryption is actually related to this problem or not. This is happening inside the binary only controller code, so I can't debug it further. But encryption is the only obvious difference between the accepted MTU requests and the rejected one.

I then tried using ble_gap_set_data_len() on the connection, after it's setup and the client (iPhone) initiated request has already failed. This generates a LL_LENGTH_REQ from the server and the phone accepts it and increases the LL MTU.

espressif / esp-idf

Unexpected NimBLE GATT performance compared to Bluedroid (IDFGH-11677) #12789

Answers checklist.

General issue report

Reproduction Notes

Software

Test Setup