HCI: after MTU change, gatt client's answer is not pushed to HCI (IDFGH-13790)

danergo commented 2 months ago

Answers checklist.

[X] I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
[X] I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
[X] I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

Latest

Espressif SoC revision.

NodeMCU-ESP-32S

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

NodeMCU-ESP-32S

Power Supply used.

External 5V

What is the expected behavior?

Stable operation

What is the actual behavior?

Manual power recycle is needed in every 12hrs.

hciconfig hci0 reset is timing out. btattach also times out.

Watchdog enabled but it is not triggering a reset. Coredump enabled but no coredump is being written. Verbose logging also enabled but only few log items are shown.

Steps to reproduce.

Ble ad scanning with hardware filtering (based on device mac and ad) at least 8 devices.

In every 5mins, try connecting to a standard (not ble) devices (which is out of range) - so connection will have to fail always.

Occasionally connect to a ble device (which is in range and shall be succeed).

Every 12 hours (roughly) we have to manually reset the esp. Otherwise hci0 will eventually go down.

Before hci0 going down, we can still try connecting to a ble device but we can't receive longer data from it.

(Ble device asks us an MTU increase, and we accept it, but then we can't receive data: but this happens ONLY after 10-12 hours of constant stressing esp with the above advertising scaninngs and 5mins inactive device connect trials).

I guess some buffer is overfilling but I couldnt enable any practical logging in menuconfig.

What do you suggest?

Debug Logs.

No response

More Information.

No response

danergo commented 1 month ago

Anyone please?! This is unbelievable. HCI times out in every single 10-12 hours and NO log is dumped/printed on debug serial.

I'm now on latest latest latest idf, with latest latest lib.

danergo commented 1 month ago

If I disable le ad filtering, esp32 can't survive for more than 3hrs. it needs to be RESETTED in every single 3 hrs.

Please someone say something! NO coredump, NO debug log, NO nothing on serial console. Only physical HW RESET is the solution for 3hrs.

danergo commented 1 month ago

I got some updates - although you don't care - shame on you.

After constant stress load: LE scanning and occasionally gatt connection to a device.

Gatt device wants to increase MTU from 23 to 247 in every connection. We accept this.

But after a couple of hours, this is simply NOT working anymore: gatt device not answering.

Now big update! After the stress test fails: I can still connect manually to this gatt device (via gatttool)!!! Gatttool refuses accepting MTU request so it stays 23 and gatt device will also answer to this.

I believe gatt device would answer anyway (no matter weather we accept/refuse its MTU change request), but esp is NOT putting it into hci anymore until it gets a reset.

Does this makes sense to you?

So: Phase1: test: gatt client changes MTU, we accept, gatt client replies and we receive it . . . After 10hrs Phase2: test: gatt client changes MTU, we accept, gatt client MIGHT answer but we NEVER receives it via hci. Phase2: workaround: don't accept the MTU request and gatt client's reply is arriving to hci.

Is it some buffer handling problem in esp32 lib?

Please folks.

danergo commented 1 month ago

Note: gate's reply is not received in either case:

When we accept client's mtu
When we initiate mtu change from server (with gatttool mtu 247).

Only in one single case will gatt client's reply reach us: when we don't accept the new mtu at all (AND we don't request MTU increase either)

danergo commented 1 month ago

Anyone knows anybody who can provide me the source of libbtdm_app.a, or I will have to reverse engineer it? Maybe @BetterJincheng?

danergo commented 1 month ago

Ping

danergo commented 1 month ago

pong

esp-zhp commented 1 month ago

@danergo hi,I sincerely apologize for the delay in getting back to you. The recent National Day holiday in China meant that I was out of the office, and I was unable to respond to messages during that time. I will check your issue,give me some time...

danergo commented 1 month ago

@esp-zhp: sure. Thank you!

I have some addition: when ESP starts this behavior (i.e doesn't forwards packets after mtu negotiations has been done), a 100% fix is to restart the ESP with its reset button (/en switch). However (update!), sometimes it's enough to do:

systemctl stop bluetooth.service
systemctl stop my-btattach.service (this is running btattach to get the hci0 interface)
systemctl start bluetooth.service
systemctl start my-btattach.service

This is (I believe) sending a hci reset command which seems to solve the issue for another 10-12 hours.

esp-zhp commented 1 month ago

Currently, there isn't enough information for me to pinpoint the issue. I have a few points of suspicion: ESP32 only support BLE4.2 HCI command,but BlueZ might send latest HCI command, causing a timeout. You can refer to https://github.com/espressif/esp-idf/issues/12650 for more information. Additionally, if the 'HCI reset command' doesn't resolve the issue, more debug information will be needed, such as providing all HCI commands and event data.If the 'HCI reset command' can resolve your problem, that would be ideal.

danergo commented 1 month ago

HCI reset command can resolve my issue.

Other than that, I have thousands (if not millions) of lines of "btmon" logs.

BlueZ doesn't send any extra command (at least nothing extra is visible in button logs).

In case hci reset fixes the mtu problem, how would it be ideal?

Thank you!

danergo commented 1 month ago

I have some news: while I was away from this device, ESP started behaving wrong again (same MTU problem).

But after constant thorough connection requests (1130 connection retrials for more than 3 hours!!!), it 'fixed' itself: now MTU negotiation doesn't ruin the incoming indication packets, but I believe after 10-12hours it will break again.

esp-zhp commented 1 month ago

If resetting the HCI resolves the MTU issue, I believe it's acceptable. In fact, MTU doesn't have an HCI command; it's only transmitted via ACL.

danergo commented 1 month ago

Sorry, I don't think it's acceptable:

MTU 23->247, every party accepts it.
Subscribe to indications -> we got acknowledge
GATT write -> we got confirmation
GATT read -> we can read
Indications below 24 -> we receive it
Indications above 23 -> we don't receive it -> this can be due to: A.) ESP fails to handle, or B.) Device moved out from range during this exact moment

Now, this device doesn't move a single millimeter, it's staying in one place (so as the ESP).

My point here is, that from my application, I can't judge weather device timeout is due to real timeout, or ESP misbehavior. Due to this, I can't accept that HCI reset solution is fair, sorry.

Anyway, in dmesg, I see these errors a lot:

[54034.156769] Bluetooth: hci0: Opcode 0x200d failed: -110
[54034.156891] Bluetooth: hci0: request failed to create LE connection: err -110

This is timeout, ESP doesn't answer for my requests. This happens, when ESP reaches the problematic phase. Problematic phase: when it accept higher MTU, but won't forward longer messages to HCI.

esp-zhp commented 1 month ago

according to log “[54034.156769] Bluetooth: hci0: Opcode 0x200d failed: -110 [54034.156891] Bluetooth: hci0: request failed to create LE connection: err -110” The log you provided is too limited, so I couldn’t extract much useful information. However, I’d like to remind you that if the 'LE Create Connection' command is issued, it will keep attempting to establish a connection until it's successful, as this action has a high priority. If the connection fails, you will need to send the 'LE Create Connection Cancel' command to stop the connection attempts. ps： BLUETOOTH CORE SPECIFICATION Version 5.4 | Vol 4, Part E page 2366

esp-zhp commented 1 month ago

Could you capture the packets to verify if the peer device is indeed sending Indications above 23? If the peer device doesn't send the Indications, then the ESP device wouldn't be able to receive them either.

If the issue still persists, I need more information to further diagnose the issue. Could you provide the complete HCI data?

danergo commented 1 month ago

Dear @esp-zhp:

I would happily provide any logs if it can help you with diag.

"LE Create connection" always succeeds. Also MTU exchange request (from 23 to 247) also always succeeds. Shorter characteristic writes, and their confirmations also always succeeds. Longer characteristics writes and their confirmations also always succeeds. Shorter indications also always received. Longer than 23 indications received for 10-12 hours, then they are not forwarded back to hci anymore.

I doubt the client has any problems because 2 quite hard reasons:

After ESP restart, everything starts working again (Longer-than-23 indications are arriving), without restarting the client.
When ESP is not forwarding the Longer indications, I tried with another bluetooth controller: it can perfectly receive the Longer indications from the client. Client hasn't changed in any way.

HCI data: I have many. All created by button -w. Is it okay for you? Can I share this privately?

esp-zhp commented 1 month ago

One more thing needs to be confirmed to narrow down the issue. Do you think your problem is related to classic Bluetooth? If you are not using classic Bluetooth, does the issue still persist? I'm responsible for BLE and don't have much knowledge about classic Bluetooth. If you believe the issue is related to classic Bluetooth, I can ask my colleagues who handle classic Bluetooth to assist you.

esp-zhp commented 1 month ago

Could you please send me all the HCI logs from the ESP32? It would be best to use GitHub so that other colleagues can also view them. If it's not convenient for you to share publicly on GitHub, you can also send them to my email (zhanghaipeng@espressif.com).

and Do you have any packet capture devices on your side?It would be even better if you could capture the packets to confirm whether the ESP32 has sent an indication（ATT_HANDLE_VALUE_IND）when mtu above 23. ps ATT_HANDLE_VALUE_IND：

danergo commented 1 month ago

@esp-zhp:

Sorry for the confusion! Let me clear things up!

I'm using ESP for Dual-Mode Bluetooth Controller (controller_hci_uart example from this repo).

The issue is related purely to BLE.

My RPi is connected via UART to this ESP. ESP is responsible for providing Bluetooth to this RPi (it has no onboard Bt).

RPi is attaching the ESP with btattach, therefore RPi sees hci0. BlueZ uses this hci0 interface to provide Bluetooth functionality to my app on RPi.

My app is constantly monitoring LE advertisements from a bunch of devices (about 10). No filtering is enabled. Occasionally (4-5 times per 10hrs) my app connects to a remote client with gatt-charactetistics, notifications and indications.

This occasional "LE Create Connection" always succeeds, but the client is asking an "MTU Exchange Request" after the connection is established (23->247). My app always accepts this new mtu, and responds as intended. Then my app subscribes to indications by writing a specific data to a gatt handle. This write is confirmed by the controller, and my app gets reply from the client. Then my app writes a long data to the client with gatt char-write-req, and this is also always gets acknowledged from ESP side over hci. Then for this large data, Client responds with a long indication (longer than 23). This long indication is arriving to my application for about 10-12 hours, then a reset to ESP is needed (sometimes, HCI reset is enough, but not always: it usually times out in this phase).

If we don't do reset, longer indications are not arriving anymore. All the rest details are provided earlier: In case my app doesn't accept the new MTU after LE Create Connection, even the longer indication arrives (but in multiple parts: 23-23-23-10 lengths).

Does this change anything?

Thank you!

danergo commented 1 month ago

@esp-zhp: mail sent to you, would be appreciated if you could take a look at it.

esp-zhp commented 1 month ago

From the log you provided, the ESP32 is acting as a master role in the GAP layer and as a client role in the GATT layer. right？

In your dmesg log, it shows a connection failure at "Wed Oct 9 10:31:15 2024": "[Wed Oct 9 10:31:15 2024] Bluetooth: hci0: Opcode 0x200d failed: -110 [Wed Oct 9 10:31:15 2024] Bluetooth: hci0: request failed to create LE connection: err -110"

I tried to find the failed connection HCI command and event in the HCI log you provided, but I couldn’t locate the connection establishment command at "Wed Oct 9 10:31:15 2024."

There are two possibilities:

1-The timestamps in the dmesg log and HCI log do not match. Please check why they aren’t aligned. 2-The BlueZ host didn’t send the create connection HCI command. If the BlueZ host didn’t issue the create connection command, there might be a bug in BlueZ, and you should investigate it further from the BlueZ side.

Additionally, I didn’t find any connection failure information in the HCI log.

The last connection attempt was at 09:07:25, and it was successful. After that, the HCI log only shows continuous scanning without any further connection attempts. ps： 09:07:25 create connect

scan

esp-zhp commented 1 month ago

First, please check why the BlueZ host didn’t issue the create connect HCI command.

Do you have the ESP32 terminal logs on your side? The ESP32 terminal logs might contain some useful information.

Here’s an example of the ESP32 terminal log(I want the full log):

danergo commented 1 month ago

Hi, @esp-zhp:

Do you have the ESP32 terminal logs on your side? The ESP32 terminal logs might contain some useful information.

Yes, I do have esp terminal logs yes, will attach here.

First, please check why the BlueZ host didn’t issue the create connect HCI command.

It is issuing the "LE Create Connection", but "btmon" recorded the timestamps in UTC, while "dmesg" output has 2 hours later timings.

From the log you provided, the ESP32 is acting as a master role in the GAP layer and as a client role in the GATT layer. right？

Yes, I believe this is the correct terminology.

The timestamps in the dmesg log and HCI log do not match.

That's correct, there is a 2hour difference:

I tried to find the failed connection HCI command and event in the HCI log you provided, but I couldn’t locate the connection establishment command at "Wed Oct 9 10:31:15 2024."

You will find it at 08:31:15 in the HCI log (packet no: 37761: LE Create Connection).

Additionally, I didn’t find any connection failure information in the HCI log. The last connection attempt was at 09:07:25, and it was successful. After that, the HCI log only shows continuous scanning without any further connection attempts.

Please check the problematic parts: from 15261 - 47498 (07:03:07 - 09:07:25 in HCI log): constant, thorough trying of connection. LE Create Connection succeeds! GATT writes confirmed! Short GATT Indications are received! But not a single sing of the long indication.

You can see a working example starting in 14584 (06:59:55,57) please pay attention to long indication for this connection in 14611 (06:59:55,69): 86 bytes long, response for our "0x0092" GATT write request in 14608 (06:59:55,64).

Connection to this client is always done by: 1.) We send LE Create Connection 2.) We Accept (by default)/Decline the Client's new MTU of 247 3.) We send GATT Write Request (Handle: 0x0093) -> Write Response received from Client (short length) 4.) We send GATT Write Request (Handle: 0x0092) -> Write Response received from Client (short length) 5.) Client send GATT Handle Value Indication (Handle: 0x0092), long length -> We send Handle Value Confirmation 6.) Disconnect

Now, with correct sequences at the beginning of the HCI logs, you can investigate this behavior, then you will see the problematic parts from 15261 - 47498 (07:03:07 - 09:07:25 in HCI log): step5 is missing. There are more than 2 hours, and more than 30000 HCI Packets trying to connect to this Client. During this period, any other device can perfectly connect to the same Client (therefore Client is behaving correctly). Also, during this period, in case I deny the MTU change, ESP will forward the indication in step5, with multiple indications (23-23-23-17).

Thank you very much, I appreciate your time spent on this.

esp.log

esp-zhp commented 1 month ago

1-create connection In the HCI log at line 37761, I noticed that the connection was not successfully established. I also observed that the BlueZ host canceled the connection after 4 seconds.

I believe it’s normal for a connection to occasionally fail within 4 seconds, as there could be interference over the air.

I reviewed the subsequent logs and saw that the connection was successfully established later. Therefore, I didn’t find any issues.

2-GATT writes confirmed! Short GATT Indications are received! But not a single sing of the long indication. At line 14593, everything works well, but at line 15312, there is no indication.

I don’t believe this is an issue with the ESP32. For the ESP32 controller, it does not differentiate between the types and contents of ATT packets; all packets are reported to the host for processing. The ESP32 controller did not receive the indication, because other ATT packets are still being reported to the host normally. I think the peer device didn’t send the indication at all, so the ESP32 controller couldn’t receive it. BlueZ wouldn't receive the indication either.

line 14593（work well）

line 15312（no indicate）

You should check why the peer device didn’t send the indication. you can debug why the peer device did not send the indication based on the following points:

1-Why does the peer device consistently send MTU requests? 2-Are the contents of the write operations we perform on the peer device all correct? I suspect this is an issue with the application layer logic, not a bug in the Bluetooth protocol stack. I noticed that the write content sent by the ESP32 is not entirely consistent. For example：

esp-zhp commented 1 month ago

Please conduct further investigation to check why the peer device did not send the indication. So far, I have not found any bugs in the ESP32.

danergo commented 1 month ago

In the HCI log at line 37761, I noticed that the connection was not successfully established. I also observed that the BlueZ host canceled the connection after 4 seconds.

It's my app: I have a 4 second timeout, so in case there is no answer from the client within 4 seconds, my app requests the cancellation in order to perform other tasks.

I reviewed the subsequent logs and saw that the connection was successfully established later. Therefore, I didn’t find any issues.

Yes, sometimes Create Connection doesn't get reply, so my app cancels it. This is not a problem at all, and this is not the issue here.

I think the peer device didn’t send the indication at all, so the ESP32 controller couldn’t receive it. BlueZ wouldn't receive the indication either.

This is not correct: I have 3 other Bluetooth controllers (Phones, Notebooks), which can ALL connect to this client (i.e. the client sends the notification correctly to all other devices. I assume it's not making an exception to ESP, why would it? :) )

1-Why does the peer device consistently send MTU requests?

It is its protocol, I can't control it, it is a product available in the market. But! In case my app declines the MTU change, ESP forwards the indication in multiple parts. Please note, for the first 10-12 hours, ESP also forwards the long indication (with the higher MTU accepted beforehand).

2-Are the contents of the write operations we perform on the peer device all correct?

Yes, totally correct (but it's a difficult, custom designed protocol by them, I can't change it).

I believe some buffer in bt_lib overruns, and then it simply drops the long indications without any single notice. (But I don't have access to that library's source to verify).

This is a rather disturbing issue: for 10-12hours ESP works as intended (so it can negotiate with this client correctly), but then, we have 2 choice: 1.) Reset ESP (disconnect power, then reconnect) 2.) Wait for 3 hours to get back to normal

This sounds like me a counter issue: after 10-12 hours it fulls, and then after 3 hours it gets cleared. Again - without any single sign in the debug log, or the HCI log.

danergo commented 1 month ago

So again: peer device I'm 100% sure it sends the notification. It does for every single other Bt device (at least 3).

danergo commented 1 month ago

Oh and more fact emphasized, which strengthen my assumption:

If I reset the ESP it will immediately receive the long indication when I try connecting.

esp-zhp commented 1 month ago

So again: peer device I'm 100% sure it sends the notification. It does for every single other Bt device (at least 3).

do you have packet capture device？

danergo commented 1 month ago

I have an unused extra esp32 around, yes.

(How) can I make a packet capture device with it?

esp-zhp commented 1 month ago

no, you need a packet sniffing devices, and I would like to analyze the over-the-air packets. However, the ESP32 is unable to capture packets transmitted by other devices.Only over-the-air packets can confirm whether the peer device has sent an indication.This is very important; I need to obtain an accurate result.

danergo commented 1 month ago

Can you suggest me some devices which are capable of doing this please?

esp-zhp commented 1 month ago

1-https://fte.com/ 2- https://www.nordicsemi.com/Products/Development-tools/nRF-Sniffer-for-Bluetooth-LE

https://item.taobao.com/item.htm?abbucket=14&id=718103919140&ns=1&priceTId=213b807717285643582898863e165e&skuId=5083524045752&spm=a21n57.1.item.2.34b3523cFBMa1h&utparam=%7B%22aplus_abtest%22%3A%22cd95b5a4f0ac2a3a6ae15fb52ef61743%22%7D&xxc=taobaoSearch

danergo commented 1 month ago

Will need to check. Is it possible that an older Android phone can do this?

danergo commented 1 month ago

@esp-zhp:

Please confirm: we have a TI CC2652 Launchpad. Along with TI's packet sniffer: https://www.ti.com/tool/PACKET-SNIFFER

Will you accept this as a proof if we set this up for sniffing packets?

danergo commented 1 month ago

I found a better one: https://github.com/nccgroup/Sniffle.

Will do the sniffing soon, and provide you the capture file as proof.

esp-zhp commented 1 month ago

I look forward to your feedback, as it will be more beneficial for my further analysis.

danergo commented 1 month ago

Thank you.

The above firmware can perfectly capture all air packets. I'm doing now the same test with sniffing on different computer, and recording ESP's HCI log also. Once ESP will stuck again in the problematic phase I'll share these captures with you.

So Once ESP fails again to receive the long indication, I have high hopes that this sniff will demonstrate this below fact: ESP-HCI log: LE Create Connection, Char-Write-Reqs, Confirmations are all correct, but long indication is missing. Sniffer log: will show that Long indication was actually sent out from the client (but we miss it from HCI of ESP).

Will you need anything else to be sniffed?

esp-zhp commented 1 month ago

great I want ESP-HCI log，ESP terminate log，captured all air packets

danergo commented 1 month ago

Hi, @esp-zhp:

Great news (not so great): ESP stuck again, but this time we have logs from the RPi that ESP is attached to (btmon -w), and also generic radio log, from a different computer.

Shared to your email directly:

minicom_4zhp: minicom log from ESP's debug port (0 bytes, nothing has been communicated)
btmon_4zhp: btmon log from RPi, which ESP is attached to
sniff_4zhp: generic bluetooth log with TI's launchpad

Let's detail these!

btmon_4zhp

Packet#105 (2024-10-10 15:23:33,434487): LE Create connection of a successful communication sequence to CLIENT
Packet#131 (2024-10-10 15:23:34,590472): Long Indication from CLIENT
Packet#5219 (2024-10-10 15:56:15,873769): LE Create connection of a successful communication sequence to CLIENT
Packet#5245 (2024-10-10 15:56:16,045126): Long Indication from CLIENT
There are many successful sequences until 2024-10-11 13:37:26
Packet#202383 (2024-10-11 13:37:34,093880): LE Create connection of an unsuccessful communication to CLIENT
Packet#202407 (2024-10-11 13:37:34,943420): We send Write Request with handle 0x0092
Packet#202409 (2024-10-11 13:37:39,956998): Disconnect (by my app, since 5seconds passed without reception of the indication
Restarted my application on the RPi: 2024-10-11 13:39:18 (Packet#202689), but it's not solved the issue:
Packet#202773 (2024-10-11 13:39:25,413762): LE Create connection of an unsuccessful communication to CLIENT
Packet#202798 (2024-10-11 13:39:30,804982): Disconnect (by my app, since 5seconds passed without reception of the indication
Made a HCI reset on the RPi: 2024-10-11 13:40:14 (Packet#202895), it solved the issue:
Packet#203172 (2024-10-11 13:40:33,724676): LE Create connection of a successful communication sequence to CLIENT
Packet#203196 (2024-10-11 13:40:34,592374): We send Write Request with handle 0x0092
Packet#203199 (2024-10-11 13:40:34,633022): Long Indication from CLIENT

Now, let's move onto the generic bluetooth sniff, which will give us (I hope) some clue, on what's going on. I'm matching the sniff log to the btmon details above:

sniff_4zhp

Sniff is started a little bit (few minutes later, so for Packet#105 it doesn't contain the raw data)
|BTMON#5219| Packet#2437 (2024-10-10 15:56:15,899178): CONNECT_IND
|BTMON#5245| Packet#2462 (2024-10-10 15:56:16,049369): Indication
|BTMON#202383| Packet#106494 (2024-10-11 13:37:34,044808): CONNECT_IND
|BTMON#202407| Packet#107145 (2024-10-11 13:37:39,110339): Write Request: HUGE PROBLEM! ESP seems NOT sending out the Write Request! I believe this Write Request is only sent out, because my app is terminating the connection.
|BTMON#202409| Packet#107147 (2024-10-11 13:37:39,111302): LL_TERMINATE_IND
Same thing happens at the next one, after I restarted my app
After HCI reset, everything went back to normal:
|BTMON#203172| Packet#110923 (2024-10-11 13:40:33,672257): CONNECT_IND - |BTMON#203196| Packet#110943 (2024-10-11 13:40:33,747061): Sent Write Request
|BTMON#203199| Packet#110949 (2024-10-11 13:40:33,776733): Long Indication from CLIENT

So now, for me it seems, that after MTU change, GATT Write Request is NOT being sent out from ESP, therefore CLIENT won't answer in time (because it was not asked to do anything).

After HCI reset, "GATT Write Request" is sent out immediately. This seems a really valid bug to me.

esp-zhp commented 1 month ago

The situation has become even more confusing. Initially, my goal was to analyze the packet capture file to determine why the indicate wasn't received. There are two possible reasons for this:

The peer device did not send the indicate so the ESP32 couldn't receive it.
- In this case, the issue lies with the peer device.
The peer device sent the indicate, but we didn't receive it.
- In this case, the issue lies with the ESP32.

However, the log you've just provided introduces a new issue: "GATT Write Request is NOT being sent out from ESP."

danergo commented 1 month ago

@esp-zhp: what do you suggest as next step? Can you check the gatt write mechanism in the lib?

As the situation is actually strange, especially if we consider all the earlier assumptions:

For the first 10-12hours, all GATT Write Requests will be sent out by ESP
After 10-12 hours, any GATT Write Request longer than 23, ESP will not send out
In case we do either a HCI Reset, or an ESP power-cycle, ESP will again send out the longer GATT Write Requests
In case we don't do any of the above (step3), but we wait for 3-4hours, ESP "fixes" itself, and starts sending out longer GATT Write Requests again (for 10-12hours).

This seems a bug or memory leak in the library.

esp-zhp commented 1 month ago

@danergo If it's a memory leak issue, I can provide you with a library to help identify it. Are you still using this version? I can generate a library for you.

In my library, I log all the information about memory that has been allocated but not released. Additionally, you can also retrieve the internal heap size of the controller.

danergo commented 1 month ago

Hi @esp-zhp:

Yes, I'm using this version.

esp-zhp commented 1 month ago

in fact, I don't really believe that a memory leak has occurred. I can provide you with the internal API, and you can use this API to check the remaining memory inside the controller.（The controller library you're currently using already includes this API, so there's no need for me to provide an additional library. If a memory leak does indeed occur, I will provide you with a new library to identify the specific line of code where the memory leak happened.） you can add code in app main：

    while (1)
    {
        extern uint16_t ke_get_heap_free_size(void);
        uint16_t free_size;
        free_size = ke_get_heap_free_size();
        ESP_LOGW(tag,"free size %d byte\n",free_size);
        vTaskDelay(10000/portTICK_PERIOD_MS);
    }

danergo commented 1 month ago

@esp-zhp: okay, will try this out.

Now, I got some new intuition: until now, we had the following two services running:

bluetooth.service
hciuart.service

hciuart.service is created by us, does nothing special, only calls "btattach" to get a hci interface.

However, bluetooth.service is providing bluetooth on top of kernel's bluetooth modules, while our application also uses low-level controlling of bluetooth (bind, socket, setsockopt, etc). So, in theory, bluetoothd and our app can disturb each other.

Now we have disabled bluetooth.service, and started a new test yesterday. Let's see how far it will go, and once ESP fails again, we will upload this new firmware with your logging option added.

Just wanted you to know that we need a couple of hours (at least) to move on.

Thank you very much for your efforts!

danergo commented 1 month ago

Okay, ESP32 stopped sending gatt writes again, so we compiled your code, and uploaded, and now restarted the test. Let's see how far can we go.

Free heap size is currently 27136 (our app is already running for 5mins).

danergo commented 1 month ago

Hi @esp-zhp:

ESP failed again, and we had to restart it. No memleak found, ie. heap size was almost always stayed 27136 (sometimes few bytes less, sometimes few bytes more, but nothing serious change happened). Not even when we tried to send long write.

Now we really don't have any clue on what's happening. Can you make a lib which prints all incoming HCI packets to debug port? I.e: when ESP receives a HCI GATT Write Request, we would like to see these on debug log. Also, it would be great to see when incoming HCI packet has been processed. Ie:

W (17:12:30,563565) BTLIB: got HCI: ATT Write Request (Handle: 0x0092) W (17:12:30,663565) BTLIB: HCI processed: ATT Write Request (Handle: 0x0092); result=success

Mostly we just listening to LE Advertisements, so this log won't be too verbose but would help a lot to diagnose on why is ESP not handling correctly our long write requests after a couple of hours.

Can we print current time (instead of elapsed time) into debug port data? ESP_LOGW prints ellapsed time.

Thanks!

esp-zhp commented 1 month ago

Based on the provided HCI log: There were a total of 37 Sent Write Requests to Handle: 0x0092, but only 25 Received Write Responses. Therefore, 37 - 25 = 12 Write Responses were not received.

From the sniffer log analysis: There were 23 Sent Write Requests to Handle: 0x0092, and 21 Received Write Responses for Handle: 0x0092. This suggests that the peer device may have failed to respond to 23-21 = 2 Write Requests, or it did respond but the sniffer device missed capturing those packets.

Considering that packet loss can occur with your sniffer device, it seems more likely that the ESP32 did not send the Sent Write Requests to Handle: 0x0092. However, we cannot completely rule out an issue with the peer device either.

For now, let's assume that the ESP32 did not send the Sent Write Requests to Handle: 0x0092 and proceed with debugging based on this assumption.

ps：

espressif / esp-idf