Open tjko opened 1 year ago
If DHCP server does not provide NTP server information... it should work to specify NTP server manually
SYS:WIFI:NTP 192.168.1.1
I don't much fancy setting up a "syslog server" to reproduce this. I'd hope that it would be reproducible with a simpler udp test? One thing you might want to consider is switch from using polling to "threadsafe background". This might be as simple as changing the initialisation and removing the poll. Work will be done in an interrupt as and when needed. It might be the polling that's causing the issue. Just an idea for something for you to try.
Thanks, I'll test if using the 'threadsafe background' mode makes difference.
This doesn't require syslog server, just some destination (sinkhole) that will 'accept' the UDP packets without generating ICMP (error) responses... If destination firewall 'drops' packets, that should do it... (since if target 'rejects' packets that should generate ICMP errors back to sender and presumably LwIP stack should detect the error and trigger some kind of error condition...)
Switching to 'threadsafe background' mode changes behavior, get steady stream of following errors from the driver:
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
[CYW43] got unexpected packet -9
Then eventually pico 'hangs' (serial console becomes unresponsive).
This was with following changes (I changed delay between UDP packets from 100ms to 10ms to get hang occur faster):
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 7a0faab..d7e0226 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -107,7 +107,7 @@ if (PICO_CYW43_SUPPORTED) # set by PICO_BOARD=pico_w
src/httpd.c
)
target_link_libraries(fanpico PRIVATE
- pico_cyw43_arch_lwip_poll
+ pico_cyw43_arch_lwip_threadsafe_background
pico_lwip_sntp
pico_lwip_http
)
diff --git a/src/network.c b/src/network.c
index 3cb7bd1..a929dc1 100644
--- a/src/network.c
+++ b/src/network.c
@@ -196,10 +196,10 @@ void wifi_poll()
return;
- cyw43_arch_poll();
+ /* cyw43_arch_poll(); */
if (network_initialized) {
- if (time_passed(&test_t, 100)) {
+ if (time_passed(&test_t, 10)) {
syslog_msg(LOG_INFO, "test message: %llu", get_absolute_time());
}
}
Could it be that this ends up in same 'bus error' condition but with the interrupt code result is hang (?)
Try wrapping your call to syslog_msg in CYW43_THREAD_ENTER / CYW43_THREAD_EXIT, but yes, it might be that the chip is still unhappy and has stopped telling us there's work to do, although I'd still expect to see the bus error.
Later: Actually I see you calling cyw43_arch_lwip_begin/end - so you shouldn't need to call CYW43_THREAD_ENTER / CYW43_THREAD_EXIT.
I suggest you get your code working in debug, add -DCMAKE_BUILD_TYPE=Debug to the cmake command line. I've had a go at fixing the debug compile issues but may have made a mistake.
The lockup seems to be in printf (bad idea for us to print from an interrupt?), can you try commenting out the line, CYW43_WARN("got unexpected packet %d\n", ret); It looks bad that we seem to be getting junk data, but lets ignore that for now. With this change your code seems to run ok.
Yes, commenting out the warning seems to have fixed the hangs I was seeing. Thank you very much!
I have test running now over 3000 seconds, while usually it would hang within 300 seconds or so, I'll leave it running overnight, if it makes past 30000 seconds that should be more than enough proof that its fixed at least from my point of view...
I had gone down the rabbit hole thinking it was something I did on the second core, since disabling second core seemed to also 'fix' the hangs. This would seem to explain why I was seeing the hang happening on 'random' places in my code...
Change I made was:
diff --git a/src/cyw43_ll.c b/src/cyw43_ll.c
index d46dd68..03581fe 100644
--- a/src/cyw43_ll.c
+++ b/src/cyw43_ll.c
@@ -1211,7 +1211,7 @@ void cyw43_ll_process_packets(cyw43_ll_t *self_in) {
} else if (ret == DATA_HEADER) {
cyw43_cb_process_ethernet(self->cb_data, len >> 31, len & 0x7fffffff, buf);
} else {
- CYW43_WARN("got unexpected packet %d\n", ret);
+ // CYW43_WARN("got unexpected packet %d\n", ret);
}
}
}
ok, so there are two issues. We probably shouldn't have any printf in an interrupt and when we send lots of data we seem to get lots of junk data. I might have to raise the last issue with cyw.
Here's the lockup...
Thread 1 received signal SIGINT, Interrupt.
mutex_enter_blocking (mtx=mtx@entry=0x20001a20
(gdb) bt
at /home/peterh/source/pico/pico-sdk/src/common/pico_sync/mutex.c:38
at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:236
at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:265
at /home/peterh/source/pico/pico-sdk/lib/cyw43-driver/src/cyw43_ll.c:1160
at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_cyw43_arch/cyw43_arch_threadsafe_background.c:143
at /home/peterh/source/pico/pico-sdk/lib/tinyusb/src/class/cdc/cdc_device.h:171
at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:132
at /home/peterh/source/pico/pico-sdk/src/common/pico_time/include/pico/time.h:102
(gdb) f 1
52 mutex_enter_blocking(&print_mutex);
(gdb) info local caller = 0 '\000' owner = 1
(gdb) info threads Id Target Id Frame
(gdb) t 2 [Switching to thread 2 (Thread 2)]
at /home/peterh/source/pico/pico-sdk/src/common/pico_sync/mutex.c:38
38 uint32_t save = spin_lock_blocking(mtx->core.spin_lock);
(gdb) bt
at /home/peterh/source/pico/pico-sdk/src/common/pico_sync/mutex.c:38
at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio_usb/stdio_usb.c:61
s=s@entry=0x20040df8 "[ 106.745013] core1: tick\n", len=28)
at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:84
at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:221
at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:246
at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio/stdio.c:265
at /home/peterh/source/pico/fanpico/src/util.c:90
at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_multicore/multicore.c:89
(gdb) f 1
at /home/peterh/source/pico/pico-sdk/src/rp2_common/pico_stdio_usb/stdio_usb.c:61
61 mutex_enter_blocking(&stdio_usb_mutex);
(gdb) info local last_avail_time = 0 owner = 0
So core 0 has taken stdio_usb_mutex when a wifi interrupt occurred on core 0 where it tries to print something and waits for print_mutex, which has been taken by core 1, which is then waiting for stdio_usb_mutex which is taken by core 0. Deadlock.
Thanks. I really need to get my debug environment working.
I had given up since had constant USB (bus) lockups, but that likely was due to the RP2040-E5 errata issue. But I finally have some Picos with RP2040-B2 so need to try again...
When trying to do some (debug) logging over syslog I ran into issues CYW43 SoC (or driver?) apparently crashing. Symptoms "[CYW43] Bus error condition detected" error messages flooding console output, and radio stopping working (cannot ping the device anymore and cannot toggle LED on Pico W anymore either).
Problem appeared after sending moderate amount Syslog UDP packets (less than 10 packets/second). (If slowing down rate of UDP packets being sent issue seems to go away...)
Errors observed on Pico W console output:
More details here: https://forums.raspberrypi.com/viewtopic.php?t=340605
Issue seems reproducible with FanPico firmware on a standalone Pico W.
To reproduce checkout the "cyw43_buserror" branch: https://github.com/tjko/fanpico/tree/cyw43_buserror
After loading fanpico.uf2 (or fanpico.elf) to Pico W, connect to console (USB or TTL Serial). Firmware uses SCPI like command set, so check that unit responds by issuing SCPI command:
Then configure WiFi and syslog server (no need to configure syslog server if DHCP provides it) and save to flash:
Unit should now reboot and connect to WiFi network. For this to work DHCP would need to provide NTP server so that Pico can get time (for syslog messages) and there be a syslog server (or at least some device that doesn't generate any ICMP responses on received UDP packets on syslog port...)
I've observed CYW43 driver to start producing Bus errors anywhere from 100-600 seconds...
Attached is sample console log captured (where 'crash' seems to occur around 184 second mark).
picow-cyw43-crash.log