earlephilhower / arduino-pico

Raspberry Pi Pico Arduino core, for all RP2040 and RP2350 boards
GNU Lesser General Public License v2.1
2.02k stars 419 forks source link

WiFiServer and FreeRTOS - Repeatedly accepting connection eventually causes freeze #2132

Closed JeronimusII closed 5 months ago

JeronimusII commented 5 months ago

When using the WiFiServer while using FreeRTOS to repeatedly accept a new connection it eventually at a seemingly random time (usually 1-10 minutes with a new connection every 3 seconds) to hang up both cores right after a new incoming connection (even before calling WiFiServer::accept()).

By pausing using a debugger and checking the call stack I figured that it is stuck in an infinite loop by FreeRTOS due to a fatal error allocating memory in the WiFiServer::_accept() method at line 197: https://github.com/earlephilhower/arduino-pico/blob/a49bcd4a9535bc9615ce46b67e91d50d501c07d0/libraries/WiFi/src/WiFiServer.cpp#L197

Which is caused by an assertion in the xQueueSemaphoreTake() method of FreeRTOS that expects the task scheduler not to be suspended: https://github.com/FreeRTOS/FreeRTOS-Kernel/blob/8e07366994f81354a2d4556ca1da9f73dab781e6/queue.c#L1672-L1677

The task scheduler seems to be suspended because the IRQ happened in the brief period the vTaskDelay() method called by some USB loop suspends the task scheduler.

I am running the program on a Raspberry Pi Pico W. I tried it with a second board in case I might have fried something.

Here is the source of program that can reproduce the issue: (Freeze happens during the delayMicroseconds() at the end of loop())

#include "FreeRTOS.h"
#include <WiFi.h>

#define WIFI_SSID "ssid"
#define WIFI_PASS "password"

WiFiServer server(5000);

void setup() {
    Serial.begin(115200);

    WiFi.mode(WIFI_STA);

    while (!Serial) {}
    delay(500);
    Serial.println("Start");

    uint8_t status = WiFi.begin(WIFI_SSID, WIFI_PASS);
    while (status == WL_DISCONNECTED) {
        delayMicroseconds(500000);
        status = WiFi.status();
    }
    if (status == WL_CONNECTED) {
        Serial.printf(
            "Connected\n"
            "   SSID: %s\n"
            "   IP Address: %s\n"
            "   BSSID: ",
            WiFi.SSID().c_str(),
            WiFi.localIP().toString().c_str()
        );
        uint8_t bssid[6]; 
        WiFi.BSSID(bssid);
        for (uint8_t i = 0; i < 6; i++) {
            if (i > 0) {
                Serial.print(":");
            }
            Serial.printf("%02hhX", bssid[i]);
        }
        Serial.print("\n");

        server.begin();
    }
    else {
        Serial.println("Failed to connect!");
        while (true) {
            delay(1000);
        }
    }
}

void loop() {
    WiFiClient client = server.accept();
    if (client) {
        Serial.println("Client Connection");
        client.stop();
    }
    delayMicroseconds(500000);
}

And here is the full call stack:

sleep_until@0x1000d02e (\home\earle\Arduino\hardware\pico\rp2040\pico-sdk\src\common\pico_time\time.c:398)
sleep_us@0x1000d08c (\home\earle\Arduino\hardware\pico\rp2040\pico-sdk\src\common\pico_time\time.c:414)
sleep_ms@0x1000d0a6 (\home\earle\Arduino\hardware\pico\rp2040\pico-sdk\src\common\pico_time\time.c:430)
rtosFatalError@0x10003630 (c:\Users\Jeronimus\AppData\Local\Arduino15\packages\rp2040\hardware\rp2040\3.7.2\libraries\FreeRTOS\src\variantHooks.cpp:339)
xQueueSemaphoreTake@0x100041ca (c:\users\jeronimus\appdata\local\arduino15\packages\rp2040\hardware\rp2040\3.7.2\libraries\freertos\lib\freertos-kernel\queue.c:1486)
xQueueTakeMutexRecursive@0x10004312 (c:\users\jeronimus\appdata\local\arduino15\packages\rp2040\hardware\rp2040\3.7.2\libraries\freertos\lib\freertos-kernel\queue.c:695)
__freertos_recursive_mutex_take@0x10003466 (c:\Users\Jeronimus\AppData\Local\Arduino15\packages\rp2040\hardware\rp2040\3.7.2\libraries\FreeRTOS\src\variantHooks.cpp:80)
__retarget_lock_acquire_recursive@0x1000aabe (c:\Users\Jeronimus\AppData\Local\Arduino15\packages\rp2040\hardware\rp2040\3.7.2\cores\rp2040\lock.cpp:150)
__malloc_lock@0x1001e148 (\workdir\repo\newlib\newlib\libc\stdlib\mlock.c:44)
_malloc_r@0x1001dd80 (\workdir\repo\newlib\newlib\libc\stdlib\mallocr.c:2362)
malloc@0x1001dcfe (\workdir\repo\newlib\newlib\libc\stdlib\malloc.c:164)
__wrap_malloc@0x1000b87a (c:\Users\Jeronimus\AppData\Local\Arduino15\packages\rp2040\hardware\rp2040\3.7.2\cores\rp2040\malloc-lock.cpp:30)
operator new@0x1000eb96 (\home\earle\Arduino\hardware\pico\rp2040\pico-sdk\src\rp2_common\pico_standard_link\new_delete.cpp:15)
WiFiServer::_accept@0x1000789c (c:\Users\Jeronimus\AppData\Local\Arduino15\packages\rp2040\hardware\rp2040\3.7.2\libraries\WiFi\src\WiFiServer.cpp:197)
WiFiServer::_s_accept@0x100078d6 (c:\Users\Jeronimus\AppData\Local\Arduino15\packages\rp2040\hardware\rp2040\3.7.2\libraries\WiFi\src\WiFiServer.cpp:218)
tcp_process@0x10015f1a (\home\earle\Arduino\hardware\pico\rp2040\pico-sdk\lib\lwip\src\core\tcp_in.c:956)
tcp_input@0x10015f1a (\home\earle\Arduino\hardware\pico\rp2040\pico-sdk\lib\lwip\src\core\tcp_in.c:438)
ip4_input@0x1001a360 (\home\earle\Arduino\hardware\pico\rp2040\pico-sdk\lib\lwip\src\core\ipv4\ip4.c:743)
ethernet_input@0x1001a7d2 (\home\earle\Arduino\hardware\pico\rp2040\pico-sdk\lib\lwip\src\netif\ethernet.c:186)
__wrap_cyw43_cb_process_ethernet@0x10023d22 (c:\Users\Jeronimus\AppData\Local\Arduino15\packages\rp2040\hardware\rp2040\3.7.2\cores\rp2040\cyw43_wrappers.cpp:56)
cyw43_ll_process_packets@0x1001b26c (\home\earle\Arduino\hardware\pico\rp2040\pico-sdk\lib\cyw43-driver\src\cyw43_ll.c:1169)
cyw43_poll_func@0x10011370 (\home\earle\Arduino\hardware\pico\rp2040\pico-sdk\lib\cyw43-driver\src\cyw43_ctrl.c:238)
cyw43_poll_func@0x10011370 (\home\earle\Arduino\hardware\pico\rp2040\pico-sdk\lib\cyw43-driver\src\cyw43_ctrl.c:219)
cyw43_do_poll@0x100119ca (\home\earle\Arduino\hardware\pico\rp2040\pico-sdk\src\rp2_common\pico_cyw43_driver\cyw43_driver.c:90)
async_context_base_execute_once@0x10012312 (\home\earle\Arduino\hardware\pico\rp2040\pico-sdk\src\rp2_common\pico_async_context\async_context_base.c:101)
process_under_lock@0x10011ee4 (\home\earle\Arduino\hardware\pico\rp2040\pico-sdk\src\rp2_common\pico_async_context\async_context_threadsafe_background.c:251)
low_priority_irq_handler@0x10012046 (\home\earle\Arduino\hardware\pico\rp2040\pico-sdk\src\rp2_common\pico_async_context\async_context_threadsafe_background.c:293)
<signal handler called>@0xfffffffd (Unknown Source:0)
vTaskSuspendAll@0x10004d34 (c:\users\jeronimus\appdata\local\arduino15\packages\rp2040\hardware\rp2040\3.7.2\libraries\freertos\lib\freertos-kernel\tasks.c:2938)
vTaskDelay@0x10005842 (c:\users\jeronimus\appdata\local\arduino15\packages\rp2040\hardware\rp2040\3.7.2\libraries\freertos\lib\freertos-kernel\tasks.c:1933)
__usb@0x10003710 (c:\Users\Jeronimus\AppData\Local\Arduino15\packages\rp2040\hardware\rp2040\3.7.2\libraries\FreeRTOS\src\variantHooks.cpp:480)
vPortStartFirstTask@0x10003778 (c:\users\jeronimus\appdata\local\arduino15\packages\rp2040\hardware\rp2040\3.7.2\libraries\freertos\lib\freertos-kernel\portable\thirdparty\gcc\rp2040\port.c:208)
earlephilhower commented 5 months ago

Thanks for the MCVE and the crash log.

This looks like a general issue with the task manager when tasks are suspended to do FreeRTOS housekeeping work (i.e. in this case to do a delay and put the task onto the queue for wake up at the right time). Basically, during that housekeeping with task switching disabled core0 got a WiFi packet interrupt which lead to a malloc which needs to grab the malloc lock and causes the crash.

In tasks.c I think I need to disable IRQs during those times when taskSuspendAll is in force.

Unfortunately, I don't seem to be able to get the MCVE to crash. I've had it in the background for 10 mins so far, with a new connection coming in every 1 second (while true; do telnet pico.io 5000; sleep 1; done).

So I am not sure I would be able to test any changes. But I should have a PR in a few minutes if you would be able to try it out., @JeronimusII , and report back.

earlephilhower commented 5 months ago

After 20 mins I did get a hang using master and while true; do telnet 192.168.1.200 5000; sleep 3; done. Using the PR #2155 I've run for 40 mins without any issue, so I think we're good. I'll let it run a while longer and then do the merge.

earlephilhower commented 5 months ago

It's run for >2hrs now with the change so I've done a merge and this will be in the next release.

JeronimusII commented 5 months ago

After 5 hours its still running fine so the issue seems to be fixed on my side too. @earlephilhower Thanks for the help!