Closed Ing-Dom closed 1 year ago
Is it possible that you get another Pico loaded with the Picoprobe (=cmsis-dap) firmware and ocnnect it to the first board, then attach it so that you can see where it crashes? (per docs)
Another useful thing would be to activate SPI logging to see at which point it crashes, or whether it crashes consistently at one point. In the Arduino IDE menu there should be Debug Level -> SPI, in PlatformIO it is
build_flags =
-DDEBUG_RP2040_PORT=Serial
-DDEBUG_RP2040_SPI
I have a Segger JLINK and Ozone, but I'm not so familiar with it yet. I give it a try.
Sadly there are reported problems with exactly J-Link and the newer GDB version we're using for now :( (downgrading core + package works, but that won't have the w5500 lwip, so it's pointless).
In any case, I do have a W5500 board and the two required Picos, so it should be no problem for me to reproduce it.
After how many loop iterations does it crash usually?
uh really? To sad.. where can I read more about that J-LINK - GDB issue?
it depends, sometimes just 100, but sometime 10.000.. shorter delays seem to make it more likely.
If you are interested in a custom RP2040 + W5500 Board I made for OpenKNX I can provide you one, have some prototypes left. - just let me know. https://ibb.co/SxST79w
I have reproduced the crash hangup with my Picoprobe setup. With debug optimizations, it it hangs after 96 iterations. It is stuck in the SPI read/write blocking code
So rx_remaining
stays 1
in every iteration but spi_is_readable()
returns false
all the time.
I'm not yet sure why the W5500 module doesn't seem to answer anymore in this specific configuration.. Or maybe it does answer but the Pico doesn't notice. Will need to hookup my logic analyzer to the SPI lines.
Rudimentary testing revealed an interesting dependency on the SPI frequency: The higher that is, the more iterations it can survive. The minimum SPI frequency for the W5500 chip is 0, so any frequency should work..
8MHz: fails after 5000
4MHz: fails after 700 [default]
400khz: fails after 270
40khz: fails after 60
4khz: fails after 4
0.4khz: hangs in init
Time to a get a look at these SPI signals.
The SPI signals made no sense to me, as I saw activity on the SPI bus that reads the Socket0 RX data, when the setup()
code should be in a delay()
. This activity occurred every 20 milliseconds.
And then it hit me. The current implementation polls this the W5500 chip in a separate FreeRTOS task, concurrent to the code running in setup()
and loop()
. And the SPI library has no mutexes to protect itself against race conditions. I think this is a plain and simple race condition on accessing the SPI object concurrently to LWIP, i.e. the LWIP task doing its every 20ms poll vs "user code" doing a eth.isLinked()
call which also directly access the SPI object.
Small correction: This is using the "pico_async_context" library to schedule the work, which also works without FreeRTOS. The work is being done in a "low priority IRQ handler", so really all the SPI polling code happens inside an IRQ, which is even trickier to mutex correctly when it interrupts the setup()
/ loop()
code at random times.
I confirmed that an extremely crude code hack with fixes this problem. Simply creating a volatile boolean variable (for single core environment) that is true
when the W5500 driver does a transaction, and the polling task / IRQ checking that flag and just not doing any work if it's already set, lets me run the previously hanging code infinitely long.
diff --git a/libraries/lwIP_w5500/src/utility/w5500.h b/libraries/lwIP_w5500/src/utility/w5500.h
index 396e8af..90b5237 100644
--- a/libraries/lwIP_w5500/src/utility/w5500.h
+++ b/libraries/lwIP_w5500/src/utility/w5500.h
@@ -38,6 +38,8 @@
#include <stdint.h>
#include <Arduino.h>
#include <SPI.h>
+extern void lock_spi();
+extern void unlock_spi();
class Wiznet5500 {
public:
@@ -156,6 +158,7 @@ private:
or register any functions, null function is called.
*/
inline void wizchip_cs_select() {
+ lock_spi();
digitalWrite(_cs, LOW);
}
@@ -165,6 +168,7 @@ private:
or register any functions, null function is called.
*/
inline void wizchip_cs_deselect() {
+ unlock_spi();
digitalWrite(_cs, HIGH);
}
diff --git a/libraries/lwIP_Ethernet/src/LwipEthernet.cpp b/libraries/lwIP_Ethernet/src/LwipEthernet.cpp
index 9a1087c..e6a7ac6 100644
--- a/libraries/lwIP_Ethernet/src/LwipEthernet.cpp
+++ b/libraries/lwIP_Ethernet/src/LwipEthernet.cpp
@@ -129,8 +129,20 @@ static async_context_t *lwip_ethernet_init_default_async_context(void) {
return NULL;
}
+static volatile bool spi_is_locked = false;
+void lock_spi() {
+ spi_is_locked = true;
+}
+void unlock_spi() {
+ spi_is_locked = false;
+}
+
// This will only be called under the protection of the async context mutex, so no re-entrancy checks needed
static void ethernet_timeout_reached(__unused async_context_t *context, __unused async_at_time_worker_t *worker) {
+ if(spi_is_locked) {
+ // are we interrupting running SPI transactions? Bye.
+ return; // don't do any work
+ }
assert(worker == ðernet_timeout_worker);
for (auto handlePacket : _handlePacketList) {
handlePacket.second();
But this really needs a cleaner fix with concurrent / IRQ-interrupted SPI accesses.
Good Morning Max. thorough analysis, and well documented also. chapeau! ;)
so really all the SPI polling code happens inside an IRQ, which is even trickier to mutex correctly when it interrupts the
setup()
/loop()
code at random times.
That design descision sounds not ideal to me. Is this really neccessary to handle in the interrupt? I know, that solutions works also when user uses blocking code (what a lot of unexperienced user in the arduino world do), but it adds some extra complexity.
I also experienced random random "blocks" lasting up to 100ms since using w5500. Sometimes my measurements indicated that nearly empty functions needed that time. That interrupt would perfectly explain this..
I'll do some measurements in the handlePackets routine..
Thanks, @maxgerhardt . That's exactly what I thought when u saw the issue. I'll look into it when I get home this coming weekend.
FWIW, we don't use FreeRTOS here. It's an async context, same as is used by the RPI team for the PicoW WiFi chip. Basically I just need to grab the context's mutex (already done elsewhere) before using SPI. If the periodic interrupt comes in during this time, it won't be able to get the mutex and will just reschedule and return immediately.
I also experienced random random "blocks" lasting up to 100ms since using w5500. Sometimes my measurements indicated that nearly empty functions needed that time. That interrupt would perfectly explain this..
Just for completeness' sake: Make sure you don't use LWIP_W5500 "just because". After all, this will replace the hardware-accelerated TCP/IP sockets of the W5500 with slower software computations in the LWIP stack (that now has to compute the Ethernet, IP and TCP header) and the W5500 in "raw MAC" mode, plus you get the polling interrupt every 20ms on the same core as usercode. The standard Ethernet library has more predictable timing (only does stuff when you Ethernet.poll()
or write to sockets) and better performance (due to using the HW-accelerated TCP/IP capabilities). The only benefit I would see in using LWIP is when you would have 2 interfaces, e.g. WiFi and LAN, or 2xLAN, and wanted to somehow route between those interface, or write unified code that doesn't care where the connection comes from, because LWIP uniformly handles it. And even in those situations, it could probably be handled with a bit more code to listen on both interfaces.
my thoughts about a solution:
what about checking link status periodically in the worker? isLinked would then only return the value retrieved by the worker. also, the lwip up and down functions to handle e.g. dhcp could be also called based on that info.
@maxgerhardt thanks for that info. My intentions for using lwip is that it is well-integrated in this core. I used Ethernet_generic before which preprocessor based configuration is a total mess...
What does Ethernet_Generic provide that arduino-libraries/Ethernet does not provide? Only switching out the used SPI object?
that's a good question and thanks for asking it. To be honest, I googled w5500 rp2040 library and it poped up.. and looked suitable. I did not even try the stock Ethernet lib, because of that SPI problem. I think IGMP was missing, too.
You may indeed need the SPI retargeting patch from https://github.com/arduino-libraries/Ethernet/pull/134. Not sure about IGMP though (Or ICMP / Ping?)
no, IGMP. KNX IP protocol is based on multicast. And without IGMP you get in trouble with the newer smart switches doing IGMP snooping.
Another benefit of using lwIP is that it is also used for PiPicoW. I also want to use WiFi instead of wired ethernet for some devices..
repeated calls of isLinked() crashes my programm. (no serials outputs anymore, no led flashing...).
It occures also when I increase the delays betweens the isLinked() calls to 2000ms.
Reproduction steps: Use this simple sketch. Link is not necessary.