HCI: after MTU change, gatt client's answer is not pushed to HCI (IDFGH-13790)

danergo commented 2 months ago

Answers checklist.

[X] I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
[X] I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
[X] I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

Latest

Espressif SoC revision.

NodeMCU-ESP-32S

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

NodeMCU-ESP-32S

Power Supply used.

External 5V

What is the expected behavior?

Stable operation

What is the actual behavior?

Manual power recycle is needed in every 12hrs.

hciconfig hci0 reset is timing out. btattach also times out.

Watchdog enabled but it is not triggering a reset. Coredump enabled but no coredump is being written. Verbose logging also enabled but only few log items are shown.

Steps to reproduce.

Ble ad scanning with hardware filtering (based on device mac and ad) at least 8 devices.

In every 5mins, try connecting to a standard (not ble) devices (which is out of range) - so connection will have to fail always.

Occasionally connect to a ble device (which is in range and shall be succeed).

Every 12 hours (roughly) we have to manually reset the esp. Otherwise hci0 will eventually go down.

Before hci0 going down, we can still try connecting to a ble device but we can't receive longer data from it.

(Ble device asks us an MTU increase, and we accept it, but then we can't receive data: but this happens ONLY after 10-12 hours of constant stressing esp with the above advertising scaninngs and 5mins inactive device connect trials).

I guess some buffer is overfilling but I couldnt enable any practical logging in menuconfig.

What do you suggest?

Debug Logs.

No response

More Information.

No response

esp-zhp commented 1 week ago

Since you plan to continue testing, you can track the following signals:

CTS 23
RTS 19
Raspberry Pi's TX signal
ESP32's TX signal printed in the terminal (the terminal will display all data received from the Raspberry Pi, so it's recommended to disable other unrelated prints)

By monitoring these signals, I believe we can ultimately determine whether it's the ESP32 not accepting more data, or the Raspberry Pi being unable to send more.

Looking forward to your further response.

danergo commented 1 week ago

Thank you, I really appreciate this.

Test is now running, waiting for the stuck. After it happens again, we will analyze these signals and report it back here for sure.

Best regards until then

danergo commented 6 days ago

Hi!

We got a recommendation of update our kernel (to 6.12 from 6.1). Our SPI2UART silicon might had some "silicon-bug" which is published as errata. Kernel developers worked on that, and seems to be fixed in latest kernel.

Bug was corrupted FIFO due to wrong interrupt timings, therefore losing data.

We are still not 100% sure, but fact is fact: it runs now since 18hrs, without a single lost byte. As it was some cases when it reached to more, we still wait for a couple of days, but in case it won't be any lost byte and missed packet, we shall consider this as "wontfix", because this was indeed not caused by ESP in any manner. In this case I wish to apologize again for taking your precious time.

Thank you!

esp-zhp commented 5 days ago

@danergo nothing. thanks for your update. If there are any updates, please let me know.

danergo commented 2 days ago

@esp-zhp:

We forwarded you our latest log. With the new kernel, communication between Host and ESP is super-smooth, running since more than 4 days without any single lost byte.

There were 2 "glitches" though, which you might give us some input on:

In the hci log (btmon_v612.pcap), we see "Hardware Error"s from controller at IDs:

423920
443764

Can you check the minicom log (debug log from ESP, also shared), in regards for these two events? Do you see any uncommon communication from our side, or any reason why ESP is presenting this Hardware Error?

Please note, the logs are quite large, as we tried to stress the system as much as we can, disabling LE ad filtering completely.

Thank you!

esp-zhp commented 1 day ago

Thank you for the detailed information. The original issue related to the Linux kernel appears resolved, so that can be considered closed for now.

Regarding the new issue, I will examine the provided btmon_v612.pcap and minicom logs for the two identified "Hardware Error" events at IDs 423920 and 443764. I'll check for any anomalies or potential reasons for these errors in the ESP debug log.

Since this is a new issue, it would be better to track it separately. Please create a new GitHub issue specifically for this problem. This will help us streamline the analysis and resolution process.

I'll provide feedback here as soon as I've reviewed the logs!

danergo commented 1 day ago

The original issue related to the Linux kernel appears resolved, so that can be considered closed for now.

Exactly, thank you! It's running now over a week (8days), without a single byte of lost data, so we are pretty confident this original issue was caused by a kernel bug (which was related to a silicon bug).

So we are thankful for you providing so much information on this topic.

We created now a new topic for the hardware error here: #14964.

Let's continue there, and finally close this one :)

Thank you.

espressif / esp-idf