Closed danergo closed 1 day ago
Since you plan to continue testing, you can track the following signals:
By monitoring these signals, I believe we can ultimately determine whether it's the ESP32 not accepting more data, or the Raspberry Pi being unable to send more.
Looking forward to your further response.
Thank you, I really appreciate this.
Test is now running, waiting for the stuck. After it happens again, we will analyze these signals and report it back here for sure.
Best regards until then
Hi!
We got a recommendation of update our kernel (to 6.12 from 6.1). Our SPI2UART silicon might had some "silicon-bug" which is published as errata. Kernel developers worked on that, and seems to be fixed in latest kernel.
Bug was corrupted FIFO due to wrong interrupt timings, therefore losing data.
We are still not 100% sure, but fact is fact: it runs now since 18hrs, without a single lost byte. As it was some cases when it reached to more, we still wait for a couple of days, but in case it won't be any lost byte and missed packet, we shall consider this as "wontfix", because this was indeed not caused by ESP in any manner. In this case I wish to apologize again for taking your precious time.
Thank you!
@danergo nothing. thanks for your update. If there are any updates, please let me know.
@esp-zhp:
We forwarded you our latest log. With the new kernel, communication between Host and ESP is super-smooth, running since more than 4 days without any single lost byte.
There were 2 "glitches" though, which you might give us some input on:
In the hci log (btmon_v612.pcap), we see "Hardware Error"s from controller at IDs:
Can you check the minicom log (debug log from ESP, also shared), in regards for these two events? Do you see any uncommon communication from our side, or any reason why ESP is presenting this Hardware Error?
Please note, the logs are quite large, as we tried to stress the system as much as we can, disabling LE ad filtering completely.
Thank you!
Thank you for the detailed information. The original issue related to the Linux kernel appears resolved, so that can be considered closed for now.
Regarding the new issue, I will examine the provided btmon_v612.pcap
and minicom logs for the two identified "Hardware Error" events at IDs 423920 and 443764. I'll check for any anomalies or potential reasons for these errors in the ESP debug log.
Since this is a new issue, it would be better to track it separately. Please create a new GitHub issue specifically for this problem. This will help us streamline the analysis and resolution process.
I'll provide feedback here as soon as I've reviewed the logs!
The original issue related to the Linux kernel appears resolved, so that can be considered closed for now.
Exactly, thank you! It's running now over a week (8days), without a single byte of lost data, so we are pretty confident this original issue was caused by a kernel bug (which was related to a silicon bug).
So we are thankful for you providing so much information on this topic.
We created now a new topic for the hardware error here: #14964.
Let's continue there, and finally close this one :)
Thank you.
Answers checklist.
IDF version.
Latest
Espressif SoC revision.
NodeMCU-ESP-32S
Operating System used.
Linux
How did you build your project?
Command line with idf.py
If you are using Windows, please specify command line type.
None
Development Kit.
NodeMCU-ESP-32S
Power Supply used.
External 5V
What is the expected behavior?
Stable operation
What is the actual behavior?
Manual power recycle is needed in every 12hrs.
hciconfig hci0 reset is timing out. btattach also times out.
Watchdog enabled but it is not triggering a reset. Coredump enabled but no coredump is being written. Verbose logging also enabled but only few log items are shown.
Steps to reproduce.
Ble ad scanning with hardware filtering (based on device mac and ad) at least 8 devices.
In every 5mins, try connecting to a standard (not ble) devices (which is out of range) - so connection will have to fail always.
Occasionally connect to a ble device (which is in range and shall be succeed).
Every 12 hours (roughly) we have to manually reset the esp. Otherwise hci0 will eventually go down.
Before hci0 going down, we can still try connecting to a ble device but we can't receive longer data from it.
(Ble device asks us an MTU increase, and we accept it, but then we can't receive data: but this happens ONLY after 10-12 hours of constant stressing esp with the above advertising scaninngs and 5mins inactive device connect trials).
I guess some buffer is overfilling but I couldnt enable any practical logging in menuconfig.
What do you suggest?
Debug Logs.
No response
More Information.
No response