arduino-libraries / Ethernet

Ethernet Library for Arduino
http://arduino.cc/
252 stars 255 forks source link

W5100 Erratum 1 patch #166

Open CONTROLLINO-Support opened 2 years ago

CONTROLLINO-Support commented 2 years ago

Hello,

there is know W5100 erratum which causes that socket.cpp -> EthernetClass::socketSendUDP() never returns. It happens during asynchronous bi-directional UDP communication when one packet is being received and another wants to be transmitted at the same time.

W5100 Errata

We have prepared a patch that implements additional timeout control above this loop. It allows the main loop to react and re-initialize the Ethernet.

bool EthernetClass::socketSendUDP(uint8_t s)
{
    SPI.beginTransaction(SPI_ETHERNET_SETTINGS);
    W5100.execCmdSn(s, Sock_SEND);

    // bug fix WizNet 5100 Erratum 1
    unsigned long timer = millis();
    unsigned long timeout = 3000;

    /* +2008.01 bj */
    while ( (W5100.readSnIR(s) & SnIR::SEND_OK) != SnIR::SEND_OK ) {
        if (W5100.readSnIR(s) & SnIR::TIMEOUT) {
            /* +2008.01 [bj]: clear interrupt */
            W5100.writeSnIR(s, (SnIR::SEND_OK|SnIR::TIMEOUT));
            SPI.endTransaction();
            Serial.println("sendUDP timeout\n");
            return false;
        }

        // bug fix WizNet 5100 Erratum 1
        if ((millis() - timer) >= timeout) {
            SPI.endTransaction();
            Serial.println("Erratum 1 WizNet bug");
            return false;
        }

        SPI.endTransaction();
        yield();
        SPI.beginTransaction(SPI_ETHERNET_SETTINGS);
    }

    /* +2008.01 bj */
    W5100.writeSnIR(s, SnIR::SEND_OK);
    SPI.endTransaction();

    //Serial.println("ok\n");
    /* Sent ok */
    return true;
}

Would it be possible to add some kind of a patch to your library, please?

Thanks, Lukas

vale1410 commented 2 years ago

For time critical applications the timeout of 3000ms can be too long. It would be important to understand under which circumstances this value can be reduced.

Our experiments show that also 150ms is fine. However, we only have a very simple network topology with a one-to-one ethernet connection and the data packages are very small.

grefab commented 2 years ago

We tried different timeouts and approaches and were able to reduce network failure rate a bit, but we never achieved a stable UDP communication over multiple days due to the problem mentioned above. In the end we realized that our erratum mitigation code became more complex than the switch from UDP to TCP would be, so we'll go with that now. We're aware that hardware defects are hard to mitigate and even harder to solve. Looks like we are the only ones facing this issue because the reaction on this ticket is sparse.

tl;dr: Our consequence is that we do not rely on UDP on W3150A+/W5100 hardware anymore. If this issue gets fixed somehow, we are happy to revisit it.

PaulStoffregen commented 2 years ago

Is there any sort of test case to reliably reproduce this bug?

grefab commented 2 years ago

When switching to TCP we kept the asynchronous protocol we designed for UDP and are still experiencing problems. The protocol roughly works like this:

Now we experience similar problems we had when using UDP:

We verified this on Windows and Linux machines.

Is it possible that the erratum not only holds for UDP, but for asynchronous communication in general? I.e. there's a rare but possible problem when sending and receiving happens at the same time? The erratum text mentions IP-raw mode, not sure if that has impact on TCP as well. If so, does this have consequences for the patch discussed in this issue?

We'll switch to a request-based communication protocol and test again. Then there's nothing happening simultaneously and we can evaluate if this improves communication stability.

CONTROLLINO-Support commented 2 years ago

Is there any sort of test case to reliably reproduce this bug?

Hi Paul,

Our test case is following:

When the Python is sending as fast as possible the test is frozen after several seconds.

When there is added 10 ms delay between the packets it seems to be stable.

Source codes are attached.

Best regards, Lukas ArduinoUDPErrorTest.zip PythonUDPTest.zip

grefab commented 2 years ago

We'll switch to a request-based communication protocol and test again. Then there's nothing happening simultaneously and we can evaluate if this improves communication stability.

We did that and tests are running on multiple machines for 48h now. We did not have any issues since the socket is used only in one direction at a time.

stephanbrunner commented 1 year ago

This commit https://github.com/tryone144/Ethernet/commit/176da0f21716fa8b92f0e4bdb96240105cfee515 solved the issue for me.

Testsetup: 8 Terminals requesting data from W5100 with full speed (maybe 10-100 requests/s each)

master branch: crashes after around 1 hour (or in our installations a couple of times per year)

with these changes: running for more than 12 hours now with no issues