georgerobotics / cyw43-driver

Other
79 stars 42 forks source link

[CYW43] Bus error condition detected #33

Open tjko opened 1 year ago

tjko commented 1 year ago

When trying to do some (debug) logging over syslog I ran into issues CYW43 SoC (or driver?) apparently crashing. Symptoms "[CYW43] Bus error condition detected" error messages flooding console output, and radio stopping working (cannot ping the device anymore and cannot toggle LED on Pico W anymore either).

Problem appeared after sending moderate amount Syslog UDP packets (less than 10 packets/second). (If slowing down rate of UDP packets being sent issue seems to go away...)

Errors observed on Pico W console output:

...
[CYW43] Bus error condition detected 0xffff
[CYW43] Bus error condition detected 0xffff
[CYW43] Bus error condition detected 0xffff
[CYW43] Bus error condition detected 0xf0
[CYW43] Bus error condition detected 0xffff
[CYW43] Bus error condition detected 0xe0ff
[CYW43] Bus error condition detected 0xff
[CYW43] Bus error condition detected 0xffff
[CYW43] Bus error condition detected 0xffff
[CYW43] Bus error condition detected 0xffff
[CYW43] Bus error condition detected 0xe0
...

More details here: https://forums.raspberrypi.com/viewtopic.php?t=340605

Issue seems reproducible with FanPico firmware on a standalone Pico W.

To reproduce checkout the "cyw43_buserror" branch: https://github.com/tjko/fanpico/tree/cyw43_buserror

$ git clone https://github.com/tjko/fanpico.git
$ cd fanpico
$ git submodule update --init
$ git checkout cyw43_buserror
$ mkdir build
$ cd build
$ cmake -DPICO_BOARD=pico_w -DFANPICO_BOARD=0804D ..
$ make

After loading fanpico.uf2 (or fanpico.elf) to Pico W, connect to console (USB or TTL Serial). Firmware uses SCPI like command set, so check that unit responds by issuing SCPI command:

*IDN?

Then configure WiFi and syslog server (no need to configure syslog server if DHCP provides it) and save to flash:

SYS:ECHO 1
SYS:WIFI:SSID myssid
SYS:WIFI:PASSWD mypassword
SYS:WIFI:SYSLOG 192.168.1.100
SYS:LOG 7
SYS:SYSLOG 7
CONF:SAVE
*RST

Unit should now reboot and connect to WiFi network. For this to work DHCP would need to provide NTP server so that Pico can get time (for syslog messages) and there be a syslog server (or at least some device that doesn't generate any ICMP responses on received UDP packets on syslog port...)

I've observed CYW43 driver to start producing Bus errors anywhere from 100-600 seconds...

Attached is sample console log captured (where 'crash' seems to occur around 184 second mark).

picow-cyw43-crash.log

tjko commented 1 year ago

If DHCP server does not provide NTP server information... it should work to specify NTP server manually

SYS:WIFI:NTP 192.168.1.1
peterharperuk commented 1 year ago

I don't much fancy setting up a "syslog server" to reproduce this. I'd hope that it would be reproducible with a simpler udp test? One thing you might want to consider is switch from using polling to "threadsafe background". This might be as simple as changing the initialisation and removing the poll. Work will be done in an interrupt as and when needed. It might be the polling that's causing the issue. Just an idea for something for you to try.

tjko commented 1 year ago

Thanks, I'll test if using the 'threadsafe background' mode makes difference.

This doesn't require syslog server, just some destination (sinkhole) that will 'accept' the UDP packets without generating ICMP (error) responses... If destination firewall 'drops' packets, that should do it... (since if target 'rejects' packets that should generate ICMP errors back to sender and presumably LwIP stack should detect the error and trigger some kind of error condition...)

tjko commented 1 year ago

Switching to 'threadsafe background' mode changes behavior, get steady stream of following errors from the driver:

[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9

Then eventually pico 'hangs' (serial console becomes unresponsive).

This was with following changes (I changed delay between UDP packets from 100ms to 10ms to get hang occur faster):

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 7a0faab..d7e0226 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -107,7 +107,7 @@ if (PICO_CYW43_SUPPORTED) # set by PICO_BOARD=pico_w
     src/httpd.c
     )
   target_link_libraries(fanpico PRIVATE
-    pico_cyw43_arch_lwip_poll
+    pico_cyw43_arch_lwip_threadsafe_background
     pico_lwip_sntp
     pico_lwip_http
     )
diff --git a/src/network.c b/src/network.c
index 3cb7bd1..a929dc1 100644
--- a/src/network.c
+++ b/src/network.c
@@ -196,10 +196,10 @@ void wifi_poll()
                return;

-       cyw43_arch_poll();
+       /* cyw43_arch_poll(); */

        if (network_initialized) {
-               if (time_passed(&test_t, 100)) {
+               if (time_passed(&test_t, 10)) {
                        syslog_msg(LOG_INFO, "test message: %llu", get_absolute_time());
                }
        }

Could it be that this ends up in same 'bus error' condition but with the interrupt code result is hang (?)

peterharperuk commented 1 year ago

Try wrapping your call to syslog_msg in CYW43_THREAD_ENTER / CYW43_THREAD_EXIT, but yes, it might be that the chip is still unhappy and has stopped telling us there's work to do, although I'd still expect to see the bus error.

Later: Actually I see you calling cyw43_arch_lwip_begin/end - so you shouldn't need to call CYW43_THREAD_ENTER / CYW43_THREAD_EXIT.

peterharperuk commented 1 year ago

I suggest you get your code working in debug, add -DCMAKE_BUILD_TYPE=Debug to the cmake command line. I've had a go at fixing the debug compile issues but may have made a mistake.

peterharperuk commented 1 year ago

The lockup seems to be in printf (bad idea for us to print from an interrupt?), can you try commenting out the line, CYW43_WARN("got unexpected packet %d\n", ret); It looks bad that we seem to be getting junk data, but lets ignore that for now. With this change your code seems to run ok.

tjko commented 1 year ago

Yes, commenting out the warning seems to have fixed the hangs I was seeing. Thank you very much!

I have test running now over 3000 seconds, while usually it would hang within 300 seconds or so, I'll leave it running overnight, if it makes past 30000 seconds that should be more than enough proof that its fixed at least from my point of view...

I had gone down the rabbit hole thinking it was something I did on the second core, since disabling second core seemed to also 'fix' the hangs. This would seem to explain why I was seeing the hang happening on 'random' places in my code...

Change I made was:

diff --git a/src/cyw43_ll.c b/src/cyw43_ll.c
index d46dd68..03581fe 100644
--- a/src/cyw43_ll.c
+++ b/src/cyw43_ll.c
@@ -1211,7 +1211,7 @@ void cyw43_ll_process_packets(cyw43_ll_t *self_in) {
         } else if (ret == DATA_HEADER) {
             cyw43_cb_process_ethernet(self->cb_data, len >> 31, len & 0x7fffffff, buf);
         } else {
-            CYW43_WARN("got unexpected packet %d\n", ret);
+           // CYW43_WARN("got unexpected packet %d\n", ret);
         }
     }
 }
peterharperuk commented 1 year ago

ok, so there are two issues. We probably shouldn't have any printf in an interrupt and when we send lots of data we seem to get lots of junk data. I might have to raise the last issue with cyw.

peterharperuk commented 1 year ago

Here's the lockup...

Thread 1 received signal SIGINT, Interrupt. mutex_enter_blocking (mtx=mtx@entry=0x20001a20 ) at /home/peterh/source/pico/pico-sdk/src/common/pico_sync/mutex.c:38 38 uint32_t save = spin_lock_blocking(mtx->core.spin_lock);

(gdb) bt

0 mutex_enter_blocking (mtx=mtx@entry=0x20001a20 )

at /home/peterh/source/pico/pico-sdk/src/common/pico_sync/mutex.c:38

1 0x1000b148 in stdout_serialize_begin () at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:52

2 0x1000b2d8 in __wrap_vprintf (format=0x10037228 "[CYW43] got unexpected packet %d\n", va=...)

at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:236

3 0x1000b320 in __wrap_printf (format=0x10037228 "[CYW43] got unexpected packet %d\n")

at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:265

4 0x10018410 in cyw43_ll_process_packets (self_in=self_in@entry=0x20004c78 )

at /home/peterh/source/pico/pico-sdk/lib/cyw43-driver/src/cyw43_ll.c:1160

5 0x10019706 in cyw43_poll_func () at /home/peterh/source/pico/pico-sdk/lib/cyw43-driver/src/cyw43_ctrl.c:236

6 0x10016280 in low_priority_irq_handler ()

at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_cyw43_arch/cyw43_arch_threadsafe_background.c:143

7

8 0x1000b4c8 in stdio_usb_in_chars (buf=0x20041b64 "\021", length=1)

at /home/peterh/source/pico/pico-sdk/lib/tinyusb/src/class/cdc/cdc_device.h:171

9 0x1000b0be in stdio_get_until (buf=buf@entry=0x20041b64 "\021", len=len@entry=1, until=...)

at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:132

10 0x1000b35a in getchar_timeout_us (timeout_us=timeout_us@entry=0)

at /home/peterh/source/pico/pico-sdk/src/common/pico_time/include/pico/time.h:102

11 0x10000810 in main () at /home/peterh/source/pico/fanpico/src/fanpico.c:326

(gdb) f 1

1 0x1000b148 in stdout_serialize_begin () at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:52

52 mutex_enter_blocking(&print_mutex);

(gdb) info local caller = 0 '\000' owner = 1

(gdb) info threads Id Target Id Frame

(gdb) t 2 [Switching to thread 2 (Thread 2)]

0 mutex_enter_blocking (mtx=mtx@entry=0x20003288 )

at /home/peterh/source/pico/pico-sdk/src/common/pico_sync/mutex.c:38

38 uint32_t save = spin_lock_blocking(mtx->core.spin_lock);

(gdb) bt

0 mutex_enter_blocking (mtx=mtx@entry=0x20003288 )

at /home/peterh/source/pico/pico-sdk/src/common/pico_sync/mutex.c:38

1 0x1000b556 in stdio_usb_out_chars (buf=0x20040df8 "[ 106.745013] core1: tick\n", length=27)

at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio_usb/stdio_usb.c:61

2 0x1000b01a in stdio_out_chars_crlf (driver=driver@entry=0x20000f20 ,

s=s@entry=0x20040df8 "[   106.745013] core1: tick\n", len=28)
at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:84

3 0x1000b05a in stdio_stack_buffer_flush (buffer=buffer@entry=0x20040df4)

at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:221

4 0x1000b2f8 in __wrap_vprintf (format=, va=...)

at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:246

5 0x1000b320 in __wrap_printf (format=0x1003039c "[%6llu.%06llu] %s\n")

at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:265

6 0x10005b9a in log_msg (priority=priority@entry=7, format=0x1002e284 "core1: tick")

at /home/peterh/source/pico/fanpico/src/util.c:90

7 0x10000380 in core1_main () at /home/peterh/source/pico/fanpico/src/fanpico.c:195

8 0x1000e1f8 in core1_wrapper (entry=0x1000035d , stack_base=)

at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_multicore/multicore.c:89

9 0x0000017e in ?? ()

(gdb) f 1

1 0x1000b556 in stdio_usb_out_chars (buf=0x20040df8 "[ 106.745013] core1: tick\n", length=27)

at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio_usb/stdio_usb.c:61

61 mutex_enter_blocking(&stdio_usb_mutex);

(gdb) info local last_avail_time = 0 owner = 0

So core 0 has taken stdio_usb_mutex when a wifi interrupt occurred on core 0 where it tries to print something and waits for print_mutex, which has been taken by core 1, which is then waiting for stdio_usb_mutex which is taken by core 0. Deadlock.

tjko commented 1 year ago

Thanks. I really need to get my debug environment working.

I had given up since had constant USB (bus) lockups, but that likely was due to the RP2040-E5 errata issue. But I finally have some Picos with RP2040-B2 so need to try again...