espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.89k stars 7.33k forks source link

HCI: after MTU change, gatt client's answer is not pushed to HCI (IDFGH-13790) #14648

Closed danergo closed 18 hours ago

danergo commented 2 months ago

Answers checklist.

IDF version.

Latest

Espressif SoC revision.

NodeMCU-ESP-32S

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

NodeMCU-ESP-32S

Power Supply used.

External 5V

What is the expected behavior?

Stable operation

What is the actual behavior?

Manual power recycle is needed in every 12hrs.

hciconfig hci0 reset is timing out. btattach also times out.

Watchdog enabled but it is not triggering a reset. Coredump enabled but no coredump is being written. Verbose logging also enabled but only few log items are shown.

Steps to reproduce.

Ble ad scanning with hardware filtering (based on device mac and ad) at least 8 devices.

In every 5mins, try connecting to a standard (not ble) devices (which is out of range) - so connection will have to fail always.

Occasionally connect to a ble device (which is in range and shall be succeed).

Every 12 hours (roughly) we have to manually reset the esp. Otherwise hci0 will eventually go down.

Before hci0 going down, we can still try connecting to a ble device but we can't receive longer data from it.

(Ble device asks us an MTU increase, and we accept it, but then we can't receive data: but this happens ONLY after 10-12 hours of constant stressing esp with the above advertising scaninngs and 5mins inactive device connect trials).

I guess some buffer is overfilling but I couldnt enable any practical logging in menuconfig.

What do you suggest?

Debug Logs.

No response

More Information.

No response

esp-zhp commented 1 month ago

Regarding the issue "ESP32 did not send the Sent Write Requests to Handle: 0x0092," I will add more logging information. Please allow me some time to do so.

Additionally, could you provide a demo that reproduces the problem on your end? It would be even better if the other device is also an ESP.I want to reproduce the issue locally.

danergo commented 1 month ago

Demo is very simple: listen to LE Ads from 8 temperature sensing devices (passive scanning without any filtering). And occasionally connect to this peer.

Slight addition to sniffer vs hci log: sniffer log was started a little bit later, so it misses 2 connection trials to this peer.

Based on the provided HCI log: There were a total of 37 Sent Write Requests to Handle: 0x0092, but only 25 Received Write Responses. Therefore, 37 - 25 = 12 Write Responses were not received.

The problem here is the same IMHO: 12 Write Requests haven't sent out for some unknown reason. This is 12 only, because we stopped the instance (anyway it could be 12000 also). So:

From the sniffer log analysis: There were 23 Sent Write Requests to Handle: 0x0092, and 21 Received Write Responses for Handle: 0x0092. This suggests that the peer device may have failed to respond to 23-21 = 2 Write Requests

We wanted to send 37 Write Requests (based on HCI log), and sniff log contains only 23 Write Request. To compensate the initial delayed start, 37 - 23 - 2 = 12. This is the amount that doesn't show up in the sniffer log, making high probability that it's not even sent out.

Anyway, 1-2 response failure can be accepted: our app will retry connecting once we don't receive response within a certain time-frame. 1-2 can be caused by radio interference too, so I wouldn't take it as serious.

packet loss can occur with your sniffer device

That's unlikely, sniffer is actually very close (3m with clear line-of-sight) to ESP.

I will add more logging information. Please allow me some time to do so.

Sure, thank you very much.

esp-zhp commented 1 month ago

Could you provide the example? If possible, I will use your example to try and reproduce the issue.thanks.

danergo commented 1 month ago

It will be very hard to reproduce it without the exact peer, unfortunately.

I think it will be faster to validate on our side. It might be some location preference too.

So even if you would try to reproduce it on your end, there shall be 'fake' peer(s) which might result different results.

If you agree, I wish to keep the test as it is, because it is 100% reproducible (although a bit slow).

danergo commented 1 month ago

@esp-zhp:

I will add more logging information. Please allow me some time to do so.

Do you have any idea about the timing of this?

Thank you!

esp-zhp commented 1 month ago

Could you run a test where, 1-after establishing a connection, the connection parameters are updated? Once the update is completed, perform a write operation.

2-If the "Sent Write Request, Handle: 0x0092" is not successfully sent, the application layer should retry after 5 seconds instead of immediately disconnecting. If the second attempt also fails, disconnect after another 5 seconds.


Additionally, I’ll provide you with a library that will print some HCI data for debugging purposes. libbtdm_app.zip


I’m still adding more logs, and I’ll send you an updated version of the library later(about one or two day later).

danergo commented 1 month ago

Thank you, we are started the test with this new libbtdm_app. Let's see, will provide you the logs, once it fails.

Filenames to share (myself-2-remember):

minicom_memleak_2412862179.cap
btmon_memleak_2412862179.pcap
sniff_memleak_2412862179.pcap
esp-zhp commented 1 month ago

@danergo ok, I will provide you with the updated library later.

esp-zhp commented 1 month ago

@danergo Do you have the test results on your side?

danergo commented 1 month ago

@esp-zhp: not yet!

It is running smoothly since [15/10/2024 10:13:55.481] (first logline's date from our app).

It logs constantly to debug port (your raw data IMHO). This might cause some timing delay in normal code execution which prevents this bug to appear. I'm not sure, just guessing, although we have never seen this module reach this far.

In case we won't have issues until end of this week, we will add more devices (both advertising, and classic bt) to see the performance.

Or, in case you have any suggestions, we are more than open to hear those.

Thank you

esp-zhp commented 1 month ago

From the sniffer log and HCI log you provided, it seems that there is an issue with the ESP32.

Additionally, I noticed that the "Sent Write Request, Handle: 0x0092" was not omitted but rather delayed, as it was only sent when you disconnected.

Moreover, every time an issue occurs, there is a connection parameter update present. So, I asked you to perform the following test changes:

1-After establishing a connection, update the connection parameters. Once the update is completed, perform a write operation.

2-If the "Sent Write Request, Handle: 0x0092" is not successfully sent, the application layer should retry after 5 seconds instead of immediately disconnecting. If the second attempt also fails, disconnect after another 5 seconds.

You conducted tests for a long time, and the issue did not reappear for the following two reasons:

1-The changes I requested altered the logic of when the problem occurred. 2-The logs I added affected some timing relationships, causing the issue to no longer occur. I tend to favor the first reason because I believe the problem may be related to the connection parameter updates. You could try modifying your code logic without using my library to see if the issue still exists.

Did you change your code logic (as I say) in your most recent test? ps:sniffer log image

image

danergo commented 1 month ago

@esp-zhp:

Did you change your code logic (as I say) in your most recent test?

We didn't change anything, since realizing that long GATT Write Requests are not being sent out.

Now, the code is still running without any single missed Write Request (or Indication).

We have never seen reaching this far with this configuration.

1-The changes I requested altered the logic of when the problem occurred.

So you did change the logic which correlates how these GATT Writes & Connection Updates behave?

Can you please provide me a new library with this change added and verbose logging disabled (or even better: make it configurable with API)?

Thank you!

danergo commented 1 month ago

@esp-zhp: still running without any single failure (all gatt write requests are sent out, and all indications are received).

Can you please provide me a new library with this change added and verbose logging disabled (or even better: make it configurable with API)?

danergo commented 1 month ago

@esp-zhp: still up, with 0 issues! :)

danergo commented 1 month ago

@esp-zhp:

Can you please provide me a new library with this change added and verbose logging disabled (or even better: make it configurable with API)?

I believe this now should be considered as fixed. Although we have to make extra sure that this fix is not caused by the verbose logging which adds load on cpu and (might) slightly modifies the scheduler.

esp-zhp commented 1 month ago

2024-10-25_102830.zip void esp_log_enable_hcievt_log(bool enable); void esp_log_enable_hcicmd_log(bool enable);

danergo commented 1 month ago

Hi, @esp-zhp:

I replaced the lib with your version:

~/esp-idf$ ll components/bt/controller/lib_esp32/esp32/
total 2284
drwxrwxr-x 2 daniel daniel   4096 okt   26 08:45 ./
drwxrwxr-x 3 daniel daniel   4096 okt    2 17:23 ../
-rw------- 1 daniel daniel 776612 okt   25 04:28 libbtdm_app.a
-rw------- 1 daniel daniel 774888 okt   14 15:02 libbtdm_app.a.logging
-rw-rw-r-- 1 daniel daniel 773808 okt    2 17:23 libbtdm_app.a.old

Then I tried to recompile, and it fails at the last step:

controller_hci_uart_esp32/main/controller_hci_uart_demo.c:63:5: error: implicit declaration of function 'esp_log_enable_hcievt_log' [-Wimplicit-function-declaration]
   63 |     esp_log_enable_hcievt_log(false);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~

This is the addition in our example compared to the stock "controller_hci_uart_demo" in this repo:

    esp_log_enable_hcievt_log(false);
    esp_log_enable_hcicmd_log(false);

    while (1) {
        extern uint16_t ke_get_heap_free_size(void);
        uint16_t free_size;
        free_size = ke_get_heap_free_size();
        ESP_LOGW(tag, "Free heap size=%d bytes", free_size);
        vTaskDelay(10000/portTICK_PERIOD_MS);
    }

Compilation fails with the "esp_log_enable..." function.

Can you please double check? Maybe we need to import something else?

esp-zhp commented 1 month ago

I test it,it works. extern void esp_log_enable_hcievt_log(bool enable); extern void esp_log_enable_hcicmd_log(bool enable); esp_log_enable_hcievt_log(0); esp_log_enable_hcicmd_log(0); log: image

danergo commented 1 month ago

Thank you, compiled, and works.

1 day passed without any issue after retesting began.

danergo commented 1 month ago

@esp-zhp, it failed again, (with logs disabled). So I believe this is indeed some timing issue.

Will share you the logs soon.

danergo commented 1 month ago

@esp-zhp: logs shared.

esp-zhp commented 1 month ago

@danergo I think this might be related to the connection parameter update. Could you run a test where, 1-after establishing a connection, update the connection parameters. wait the update is completed,and then perform a write operation.

2-If the "Sent Write Request, Handle: 0x0092" is not successfully sent, the application layer should retry after 5 seconds instead of immediately disconnecting. If the second attempt also fails, disconnect after another 5 seconds.

3-don‘t use my lib, origin lib is ok image

thanks.

danergo commented 1 month ago

@esp-zhp:

My code has no influence or knowledge of L2CAP messages.

It just:

  1. Connects, then AFTER it succeeded:
  2. Sends out a characteristic-write-req to 0x0093 with 0200 data, then AFTER confirmation has been received:
  3. Sends out a long characteristic-write-req.

Moreover, in case I try manually (with gatttool): this means im typing manually, no script involved here: 1.) Connect, then wait until connected 2.) Mtu 247 (gatttool denies client's mtu increase request, but this command sets it correctly), wait for confirm 3.) Char-write-req 0x0093 0200, wait for confirm 4.) Char-write-req long

This WORKS for 10-12hrs, but then suddenly long writes are not sent out anymore, UNTIL esp gets reset. And as you can see gatttool has no info on l2cap.

Now I'm more than sure gatttool is not changing anything under the hood after 10-12 hrs, moreover I don't have to restart the Linux, only the esp to get back to normal.

Moreover in case we enable the verbose log in your library, it hides this problem.

Again, this is something we can't control, and now it's more than likely that esp behaves different after running 10-12 hrs.

We can solve this by either running it wit verbose log, or restart it every hour. Neither is too effective.

What do you think?

esp-zhp commented 1 month ago

you may not have understood what I meant. 1-I've noticed the issue: your write request wasn't sent as expected. 2-I suspect this is related to the connection parameter update. so I'd like you to wait until the connection parameter update completes before performing the write operation, rather than writing immediately.-If the "Sent Write Request, Handle: 0x0092" is not successfully sent, the application layer should retry after 5 seconds instead of immediately disconnecting. If the second attempt also fails, disconnect after another 5 seconds.

-> I believe this approach might solve your issue.

danergo commented 1 month ago

Dear @esp-zhp:

I understand your points, let me rephrase my concerns here, reacting on yours:

1-I've noticed the issue: your write request wasn't sent as expected.

Please see these two below screenshots, taking into consideration my labels and explanations:

(For others: each screenshot consist of two logs: upper is recorded with a sniffer, and bottom is logged directly on the RPi which ESP32 is attached to)

Good commincation trial steps: # What? Source Destination SniffID DirectID
1 Exchange MTU Request from Slave to Master 2442 5228
2 Exchange MTU Response from Master to Slave 2446 5230
3 Write Request (0x0093) from Master to Slave 2448 5231
4 Connection Parameter update request from Slave to Master 2449 5234
5 Write Response (for 0x0093) from Slave to Master 2452 5241(!!!) Look at this!
6 Connection Parameter update response from Master! to Slave 2453 5237(!!!)
7 Write Request (0x0092) from Master to Slave 2457 5242

Key takeaways for good trial:

  1. slave sends Connection Parameter update request and does NOT wait for master to reply, but it also sends Write Response. We can not control this behavior, as client is an active product on the market.
  2. Master first sends out the Connection Parameter update response, then it sends out the Write Request
  3. Communication succeeds, 0x0092 Write request is being sent out (please see 7 in above table and in both sniffer and local log below!)

goodtrial

Okay, let's move on to the stage when ESP starts behaving wrong:

Failed commincation trial steps: # What? Source Destination SniffID DirectID
1 Exchange MTU Request from Slave to Master 106498 202392
2 Exchange MTU Response from Master to Slave 106503 202394
3 Write Request (0x0093) from Master to Slave 106505 202395
4 Connection Parameter update request from Slave to Master 106506 202399
5 Write Response (for 0x0093) from Slave to Master 106508 202406(!!!) Look at this!
6 Connection Parameter update response from Master! to Slave 106509 202402(!!!)
7 Write Request (0x0092) from Master to Slave !!MISSING!! 202407

Key takeaways for failed trial:

  1. Sequence is exactly the same in both good and failed trials! Only difference is that step7 is missing in snifflog, meaning ESP failed to transfer it out. Please see it in log below: local log shows, but sniffle log doesn't.
  2. Please also, refer back on the good trial: we don't wait there for "LE Connection Update Complete" message before sending out the long GATT Write Request, meaning ESP Controller handles the same sequences differently in these 2 scenarios.

failedtrial

Additional notes, regarding your recommendation of

If the "Sent Write Request, Handle: 0x0092" is not successfully sent, the application layer should retry after 5 seconds instead of immediately disconnecting.

  1. It's not the application layer, but the Client which is automatically disconnecting after a short period in case not hearing from us.
  2. No matter how many times we try, long writes are NOT being sent out at all until we reboot ESP.

For 2.) please see this new log (btmon_finalproof.cap, shared with your email too):

I followed these steps now:

  1. Connect to Client
  2. Set MTU (as gatttool by default denies the request from Client)
  3. Send sample data with length of 63 bytes 5 times: ALL GOOD
  4. Send sample data with length of 64 bytes controller doesn't ACK it (see ID#53)!

image

Key takeaway: This is the root cause of the problem: after 10-12 hours, ESP's controller does NOT respect MTU changes anymore, leaving the RX buffer on default 63 bytes therefore ESP is NOT able to process any longer messages until it's reset.

I hope you accept this as proof. To me this looks straightforward, but please let me know in case you need any further information/clarification. I really appreciate your help and efforts!

Thank you very much

danergo commented 1 month ago

leaving the RX buffer on default 63 bytes

I meant HCI rx buffer here.

danergo commented 1 month ago

@esp-zhp: did you have time for checking my last details here?

esp-zhp commented 1 month ago

@danergo sorry for later reply,I'm still analyzing the cause. After adding the HCI log, the issue stopped occurring, which complicates debugging. I'll add more debug logs to investigate further.

esp-zhp commented 1 month ago

sniffer: It is evident that the write request was not sent on time. However, it was still sent. image image

If there is no response within 3 seconds, could you resend the write request?

danergo commented 1 month ago

Unfortunately I can't resend, as the remote Client is disconnecting after about 1sec in case it doesn't hear from us. You only see this terminate command, because my app doesn't receive any notification about client's disconnect, so we implemented a (max)5 second timeout: after 5seconds, it is 100% impossible that we are still connected so it's safe to drop and close the connection.

Client is on battery so it has limited time for communication.

What is strange for me is that 63byte threshold: it really seems some default threshold. I can send anything below 64 length, and nothing from 64bytes length. That should be a great way to start finding some spots :)

danergo commented 3 weeks ago

Hi, @esp-zhp: do you have any news?

danergo commented 3 weeks ago

@esp-zhp: i have a new idea:

Creating a firmware in which we can turn on/off logs dynamically via debug uart (i.e.: not hardcode log setting into firmware). In this case we can start with hci logging turned off, then once issue occurs, we can turn on hci logging over debug port, without rebooting esp.

What do you think? Can this help? How shall we read from debug uart without disturbing the btdm controller?

esp-zhp commented 3 weeks ago

@danergo Sorry for the late reply again.

Based on your description, there is a version with HCI logging enabled. If HCI logging is always on, it works fine, but power consumption will be higher. For now, you can use the library with HCI logging enabled to continue your development.

You mentioned: “Creating a firmware that allows us to dynamically turn logging on and off via the debug UART (i.e., without hardcoding log settings into the firmware). In this case, we can start with HCI logging turned off, and then, if an issue occurs, enable HCI logging through the debug port without rebooting the ESP.”

I’m not sure if this approach will work, as I haven’t yet identified the root cause of the issue. I will update the library and plan to add debug information for suspected areas. When an issue occurs, you can dump this debug information (I’ll provide a new API for this), and I’ll use the data to help resolve the problem.

I’m still working on the debug library, and this may take some time. In the meantime, please proceed with your testing and development.

danergo commented 3 weeks ago

Okay, thank you!

danergo commented 2 weeks ago

Hi, @esp-zhp: did you have some time to progress?

We are running our main service with HCI logs enabled, and it's smooth. But I think this shall not be considered as a final solution.

Thank you!

esp-zhp commented 2 weeks ago

yes,I will give you lib this week

danergo commented 2 weeks ago

Dear @esp-zhp:

We are sending you a new "proof" capture: this time with:

This contain a LOT of garbage, please forgive us. Please focus on the ATT frames. You will see normal procedures of GATT writes and confirmations. Focus on 0x0092 Write Response!

You'll see: 0x0092 responses are coming until ID#634057 (last Write Response in log). Afterwards, you'll see no more Write Responses (for 0x0092).

As this time we have the debug log enabled, we can state, that simply enabling "HCI verbose log" doesn't solve this problem. (Luckily! At least, logging is not hiding this bug :) )

Please check our logs, maybe you'll see something interesting which can help you with this topic.

esp-zhp commented 2 weeks ago

Here’s the English translation and refinement:


I analyzed the HCI log file “minicom_newproof.cap” that you provided. Using internal tools, I converted the HCI log into a BTsnoop file format, which is compatible with software like Wireshark and "Wireless Protocol Suite."

At timestamp 654,633, there is an ATT_WRITE_REQ with handle 147:

ATT_WRITE_REQ  147  Not Mapped  0(C)  Unknown [3]  14  00:00:00.000063  2024/11/14 11:29:20.135370  147

I noticed that the controller did not subsequently receive any HCI data related to Attribute Handle: 146. In other words, the controller did not receive the write request (handle 0x92) issued by the Bluz host. This behavior appears abnormal. I don’t fully understand this behavior. Why didn’t the host continue to send the write request (handle 0x92)? Or did it send the request, but the controller simply didn’t receive it? image

ps:HCI log from ESP32(decoded) I will send you by email to protect your privacy.

esp-zhp commented 2 weeks ago

From the HCI log on your host side, it appears that Bluz sent a write request (handle 0x92), but the ESP32 controller either did not send the packet or failed to send it promptly. This seems to be an issue on the ESP32 controller's side.

However, from the ESP32 controller's perspective, it did not receive the write request (handle 0x92).

I’m unsure how the HCI log on Bluz was captured or whether it is entirely accurate. If this issue occurs, could restarting Bluz while keeping the ESP32 controller running resolve the problem?

danergo commented 2 weeks ago

@esp-zhp:

I noticed that the controller did not subsequently receive any HCI data related to Attribute Handle: 146. In other words, the controller did not receive the write request (handle 0x92) issued by the Bluz host. This behavior appears abnormal. I don’t fully understand this behavior. Why didn’t the host continue to send the write request (handle 0x92)? Or did it send the request, but the controller simply didn’t receive it?

That's exactly the root cause of the problem! Please note: all ATT 146 messages are long (longer than 63 bytes).

I have tested this many times: after ESP (or BlueZ) gets into this problematic stage, any write request (ATT handle doesn't matter) with data length above 63 won't send out (anything below or equal 63 will be sent out perfectly)!

Let's try to identify the wrong part here: 1.) BlueZ doesn't forwards the 146: can be, but highly unlikely, as btmon (part of BlueZ) still logs the long gatt writes even during the problematic stage. 2.) ESP doesn't receive the 146: can be, as there is no 146 in the decoded sniff.

For the 100% proof, let me wait for a couple of days to let the problematic stage arrive, and then I'll hook up the oscillator to the ESP's RX pin, to make you a final proof. :)

Thank you!

danergo commented 1 week ago

Hi, all (@esp-zhp).

We have concluded a final test with an O-scope tapped onto the TX pin of RPi (RX pin of ESP32).

Our assumption has been 100% validated now: RPi sends the long gatt write, but ESP32 doesn't acknowledge it.

Please see the shared log: it's a small one!

Facts:

  1. First we send a few short gatt writes (0x00001111): works OK (Write Request + Write Response)
  2. Secondly, we send a few medium length gatt writes (0x0000111222333444555): works OK (Write Request + Write Response)
  3. Lastly, we send long gatt writes (0x0000000011111111...eeeeeeeeffffffff): FAIL at ESP side!: it is not acknowledging the write.

For the last step (3), we have recorded the screen of the scope: this is the proof, that ESP's RX pin is receiving this long write (takes almost 700us), but I assume you will not see this in the debug log from ESP (shared with you too):

image

This is proved now an ESP-side bug, and it is present in the closed-source library, so we can't do anything else on our side to fix it.

esp-zhp commented 1 week ago

@danergo Thank you for your feedback.

  1. You've demonstrated with an oscilloscope that the host is sending data to the ESP32 controller, which is great. However, there's still a question: the oscilloscope lacks the functionality of a logic analyzer. This means we cannot be 100% certain that the transmitted data is correct unless we use a logic analyzer to parse the UART data from the ESP32's RX pin.
    That said, we currently have strong reason to believe that the ESP32 UART is not correctly receiving the data. While this cannot be confirmed with absolute certainty at the moment, it is highly probable that the ESP32 is failing to receive the data properly.

  2. Additionally, I reanalyzed the minicom_newproof.cap file you provided, which shows all the data received by the ESP32 controller.
    I found an anomaly (which is crucial):
    Normally, ACL data begins with 0x02, as per the specification. However, during the issue, the data received by the ESP32 was missing the first 5 bytes. Within the received data, "12 92 00" indicates a write request to handle 0x92, which is the data we need!

image image

  1. Continuing with debugging:
    (Assuming the host's transmitted data is 100% correct)
    The reason the write request wasn't sent out is that the UART failed to receive the data correctly and dropped part of it. Here are some potential points to investigate:

    3.1 Has the ESP32 entered sleep mode?
    By default, it doesn't unless you've modified the sdkconfig. If you have, could you share your sdkconfig file with us?

    3.2 Is UART flow control enabled on the ESP32?
    Please confirm whether CTS (GPIO 23) and RTS (GPIO 19) are being used. If not, UART issues may arise.

    3.3 Add some UART debugging:
    When the issue occurs (i.e., when the write request fails to send), print the UART registers. Here's a sample code snippet:

    #include "soc/uart_reg.h"
    #include "driver/uart.h"
    uint32_t int_raw_status = REG_READ(UART_INT_RAW_REG(CONFIG_BTDM_CTRL_HCI_UART_NO));
  2. A potential solution:
    The ESP32 controller supports both UART and VHCI methods.
    (Top) → Component config → Bluetooth → Controller Options → HCI mode

    Currently, the esp-idf/examples/bluetooth/hci/controller_hci_uart_esp32 example uses UART directly instead of VHCI, and it doesn't leverage the IDF's UART driver.
    For other BLE and BT examples, we typically use VHCI directly.
    Here are the relevant functions:

    /* VHCI */
    extern bool API_vhci_host_check_send_available(void);
    extern void API_vhci_host_send_packet(uint8_t *data, uint16_t len);
    extern int API_vhci_host_register_callback(const vhci_host_callback_t *callback);

    Currently, the issue lies in the ESP32 controller not receiving data (by uart) from the host correctly.
    Therefore, you can transmit HCI data via the VHCI interface instead of using the controller's UART.
    Specifically:

    • Implement a UART program using the IDF driver to receive data, then send it to the controller via the VHCI interface.
    • Similarly, data from the controller can also be returned through VHCI.

    VHCI has already undergone extensive testing, so I believe it can address your issue.

    This task can be undertaken by either you or us, though it will take some time.
    Additionally, I noticed that someone has already started working on this, but I haven’t tested their implementation yet.
    Here is the code link:
    [ESP-Hosted Network Adapter Bluetooth Slave](https://github.com/espressif/esp-hosted/blob/master/esp_hosted_ng/esp/esp_driver/network_adapter/main/slave_bt.c)

danergo commented 1 week ago

Thank you for extensive description.

I am more than sure that my long write is correct although I have shared a scope screenshot and not a signal analyser (measured shorter gatt writes, and scope showed shorter data flow).

However, we do have signal analyser as well, will provide you that evidence too.

Will also share our sdkconfig, sleep mode (as far as I remember) is enabled.

Will attach it here.

Thank you!

esp-zhp commented 1 week ago

Great! Make sure to print the relevant registers when an issue occurs. Additionally:

Disable sleep mode. Enable UART flow control.

esp-zhp commented 1 week ago

how do you think A potential solution?

danergo commented 1 week ago

Uart flow control is enabled.

Anyway, for VHCI we have just a slight knowledge. How shall we attach to the BlueZ? With btattach? Will vhci present a device in /dev?

Thank you. Will get back to you with details later today.

danergo commented 1 week ago

OMG, my sdkconfig is huge, compared to example's. Please find it attached here.

Many sleep configs are enabled (which might have been modified by us). Will do now 2 things:

1.) Signal-trace the RX pin for the HCI data 2.) Recompile the controller's code with default sdkconfig (or a slightly modified, without sleep).

We have high hopes about this sleep configuration that it can solve this problem :)

Thank you!

sdkconfig.zip

danergo commented 1 week ago

Hi, @esp-zhp:

I have some news!

We hooked up our signal tracer onto the RX pin, and see what happened. We measured two cases:

1.) gatt write request with data of 0000000011111111:

Correct

This is great, works OK.

2.) gatt write request with data of 0000000011111111...ffffffff:

Incorrect

As you can see, RPi is sending only 64bytes of data!

By this time, we thought this must be on RPi (BlueZ or kernel) side.

But we went a little further and reverted back our sdkconfig to the defaults (no sleep enabled now). So, we also had to reflash the ESP, and of course we had to restart BlueZ too. Signal tracer was still hooked onto the RX pin, and by some random idea we did another measurement (now gatt long write is working, as usual after reboot).

But then we have some amazing result:

image

As you can see, there are two "batches"! First batch is exactly 64bytes long. Delay between two batches is approximately 100us.

Then we pushed even further with a very long gatt write:

image

No need to say anything. Our SPI to UART component have a 64byte FIFO buffer.

Is is possible that ESP is holding down (or up) the CTS (or RTS?) pin, in order to prevent further data transmission because it must process the current batch?

In full UART communication, which pin is used (driven) by ESP in order to signal the RPi to prevent data transmission? RTS or CTS?

Thank you very much!

esp-zhp commented 1 week ago

Based on your feedback, the current issue seems to be that the Raspberry Pi is not transmitting data correctly, rather than a problem with the ESP32 receiving end. Is that correct?


When hardware and drivers support RTS and CTS, their behavior is automatically managed, and user intervention is typically not required.

If you wish to manually intervene, it should be feasible. Feel free to test it.
However, may I ask why you want to take this approach?


RTS and CTS Collaboration

RTS and CTS work together to achieve hardware flow control:

  1. When Device A is ready to send data, it pulls its RTS pin low to notify Device B.
  2. Device B checks if it can receive data. If it can, it pulls its CTS pin low to notify Device A.
  3. Once Device A detects a valid CTS signal, it starts transmitting data.
  4. If Device B finds it cannot continue receiving data during transmission (e.g., buffer full), it pulls CTS high to signal Device A to stop transmitting.

From the logs:

I (2168) CONTROLLER_UART_HCI: HCI UART1 Pin select: TX 5, RX 18, CTS 23, RTS 19 Baudrate: 921600

Pin CTS 23 is used to signal the Raspberry Pi to prevent data transmission.

danergo commented 1 week ago

Thank you.

Based on your feedback, the current issue seems to be that the Raspberry Pi is not transmitting data correctly, rather than a problem with the ESP32 receiving end. Is that correct?

Yes, it seems being on our side, so we wish to say a huge sorry and thank for your kind help through all of this.

We don't want to control the CTS/RTS behavior. We just wanted to check weather it's ESP not accepting more, or RPi can't send more.

uart_set_pin(CONFIG_BT_HCI_UART_NO, 5, 18, 19, 23);

This case we need to trace PIN 23 on ESP to see if it's need to stop us sending, right?

One final test we must do to confirm this is really on the RPi side: once it fails again, we need to signal trace the CTS too: it still can be that in failed case ESP is not pulling the CTS line low fast enough for RPi to continue sending and it halts the transmission (although it's fairly unlikely - BUT! In case we don't reset ESP after failed case, it's not getting back to normal, so still pretty mysterious).

Thank you!