arduino-libraries / Ethernet

Ethernet Library for Arduino
http://arduino.cc/
254 stars 257 forks source link

Eternal loop in EthernetClass::socketSendUDP #78

Open fredilarsen opened 5 years ago

fredilarsen commented 5 years ago

Description

I have a protocol that uses UDP broadcast to let devices communicate. This works fine except for that from time to time a random device will lock up and stop responding permanently. This may take days or multiple weeks to happen but is nevertheless catastrophic.

Having searched for the cause of this by printouts and patience I found that it is always stuck in an eternal loop in socket.cpp, in the EthernetClass::socketSendUDP function. It is waiting for SEND_OK, and will also exit if it gets TIMEOUT, but in my case it gets a RECEIVE.

This sounds similar to the TCP issue where the W5x00 receives while trying to send: https://github.com/arduino-libraries/Ethernet/issues/17

Reproducing

I tried to speed up the process of triggering a freeze, and the attached program UDPCrash_nolibmod.ino does it within 1-30 minutes. Connect two Arduino Unos with a W5x00 based Ethernet shield each with USB to the PC, then start up two instances of Arduino IDE connected to one Arduino each, both with the serial monitor open at 115200bps. Flash the sketch to both Arduinos, modifying the MAC address for one of them before flashing.

Every 10 seconds each of the two Arduinos write a line telling how many packets it sends and receives. After some minutes one of the devices stop printing and the other will print 0 packets received from then on.

The program is trying to send a packet every millisecond, not a realistic scenario. But this is only to quickly catch the error that will occur after a much longer time in normal use. In my case with data exchange between multiple devices every 10 seconds.

Software environment

Arduino IDE 1.8.9 running on Windows 10

Hardware environment

Arduino Uno with the good version of the W5100 on-top netork shield, or Uno or Nano with the red small W5100 based network card or the small blue W5500 based network card.

Code to trigger the failure:

UDPCrash_nolibmod.ino

/*
 Based on Ethernet example UDPSendReceiveString.

 This sketch should be run in two or more instances connected to the same LAN using a 
 W5x00 based Ethernet shield each. 
 After some minutes / hundreds of thousands of packets, it ususally triggers a failure of
 the W5x00 and the traffic halts permananently until both the W5x00 and Arduino is reset
 simultanously.
 */

#include <Ethernet.h>
#include <EthernetUdp.h>

// Enter a MAC address and IP address for your controller below.
byte mac[] = { 0xDE, 0xAD, 0xBE, 0xEF, 0xFE, 0xE0 };

unsigned int localPort = 7000;      // local port to listen + send on

// A buffer for receiving data
char packetBuffer[UDP_TX_PACKET_MAX_SIZE];  // buffer to hold incoming packet,

// An EthernetUDP instance to let us send and receive packets over UDP
EthernetUDP Udp;

uint8_t _broadcast[4] = { 0xFF, 0xFF, 0xFF, 0xFF };

void setup() {
  Ethernet.begin(mac);

  // Open serial communications and wait for port to open:
  Serial.begin(115200);
  while (!Serial) ; // wait for serial port to connect. Needed for native USB port only
  Serial.println("Starting UDP crash test program.");

  // Check for Ethernet hardware present
  if (Ethernet.hardwareStatus() == EthernetNoHardware) {
    Serial.println("Ethernet shield was not found.  Sorry, can't run without hardware. :(");
    while (true) delay(1); // do nothing, no point running without Ethernet hardware
  }
  if (Ethernet.linkStatus() == LinkOFF) {
    Serial.println("Ethernet cable is not connected.");
  }
  Udp.begin(localPort);
}

uint32_t cntOut = 0, cntIn = 0, lastPrint = 0, lastSend = 0, failOut = 0, totalOut = 0, totalIn = 0;

void loop() {
  // Broadcast packet 
  if ((uint32_t)(millis()-lastSend) > 0) {
    lastSend = millis();
    Udp.beginPacket(_broadcast, localPort);
    const char *s = "Testing to send a relatively long string to see if it succeeds, or if it absolutely fails.";
    Udp.write(s, strlen(s));
    if (Udp.endPacket()) {
      cntOut++;
      totalOut++;
    } else failOut++;
  }

  // if there's data available, read a packet
  int packetSize = Udp.parsePacket();
  if (packetSize) {
    cntIn++;
    totalIn++;

    // read the packet into packetBufffer
    Udp.read(packetBuffer, UDP_TX_PACKET_MAX_SIZE);  
  }

  if ((uint32_t)(millis()-lastPrint) > 10000) {
    lastPrint = millis();
    Serial.print("TotOut "); Serial.print(totalOut); 
    Serial.print(", TotIn "); Serial.print(totalIn); 
    Serial.print(", NowOut "); Serial.print(cntOut); 
    Serial.print(", NowIn "); Serial.print(cntIn);
    Serial.print(", Failed "); Serial.println(failOut); 
    cntOut = cntIn = 0;
  }
}
fredilarsen commented 5 years ago

I added detection of the eternal loop in socket.cpp by using a loop counter, and let it return just like it does when getting a TIMEOUT. This works, and it survives the event.

BUT when this has happended a few times (I think about 4 times on the W5100), the Arduino continues running but no traffic goes out of or into the Ethernet card. So it seems some resource (a socket?) is not freed within the W5100.

This fix was not sufficient. I tried to clear SEND_OK just like when returning after a TIMEOUT, with no change.

The modified EthernetClass::socketSendUDP shown below, located at the end of socket.cpp, fixes the eternal loop / freeze but not the underlying problem in the W5x00. I did add the two global variables to be able to let other parts of the code observe the problem, without me changing the API (the functions in the library).

bool EthernetClass::socketSendUDP(uint8_t s)
{
    udp_send_error = 0;
    SPI.beginTransaction(SPI_ETHERNET_SETTINGS);
    W5100.execCmdSn(s, Sock_SEND);

    /* +2018.11 fil: introduced loop counter */
    uint32_t count = 0;
    uint8_t status = 0;
    while ( (W5100.readSnIR(s) & SnIR::SEND_OK) != SnIR::SEND_OK ) {
        count++;
        status = W5100.readSnIR(s);
        if (status & SnIR::TIMEOUT) {
            /* +2008.01 [bj]: clear interrupt */
            W5100.writeSnIR(s, (SnIR::SEND_OK|SnIR::TIMEOUT));
            SPI.endTransaction();
            return false;
        }
        /* +2018.11 fil */
        /* Caught a known bug with the W5x00. While waiting for SEND_OK or TIMEOUT,
           a RECV was returned because something arrived while trying to send,
           and SEND_OK or TIMEOUT is never received and the loop would never exit. */
        if (count > 100000) {
            socket_exit_count++;
            udp_send_error = 1;
            Serial.print("GOT STATUS "); Serial.println(status);            
            W5100.writeSnIR(s, SnIR::SEND_OK);
            SPI.endTransaction();
            return false;
        }

        SPI.endTransaction();
        yield();
        SPI.beginTransaction(SPI_ETHERNET_SETTINGS);
    }

    /* +2008.01 bj */
    W5100.writeSnIR(s, SnIR::SEND_OK);
    SPI.endTransaction();

    //Serial.printf("sendUDP ok\n");
    /* Sent ok */
    return true;
}
fredilarsen commented 5 years ago

I added some code to EthernetUDP::endPacket in EthernetUdp.cpp to catch the first instance of the problem to try to correct it. I tried to close the socket and then call begin(_port) to continue, without effect. Added a reset function to W5100.h/cpp that calls softReset and then forces a new init, but then Ethernet settings were forgotten and the sketch had to do a Ethernet.begin and Udp.begin again, and also be aware of the problem.

I ended up with this change in EthernetUdp.cpp, letting it catch the error, close the socket, reset the W5100 without forgetting settings, then start with the same port again:

// Variable defined and set in socket.cpp
extern uint8_t udp_send_error;

int EthernetUDP::endPacket()
{
    bool result = Ethernet.socketSendUDP(sockindex);
    if (!result && udp_send_error) {
        /* +2018.11 fil */
        // Caught a known bug with the W5x00. While waiting for SEND_OK or TIMEOUT,
        // a RECV was returned because something arrived while trying to send,
        // and SEND_OK or TIMEOUT is never received.
        // After this happens a very few times, the W5x00 is left inoperable.
        // Try to compensate when this happens by closing, resetting and calling 
        // begin again with the same port.
        stop();
        W5100.reset();
        begin(_port);
    }
    return result;
}

This function I did add to W5100 to force a reset but keep the essential settings:

uint8_t W5100Class::reset(void) {
    // Get the current configuration
    uint8_t gw[4], sn[4], mac[8], ip[4];
    W5100.getGatewayIp(gw);
    W5100.getSubnetMask(sn);
    W5100.getMACAddress(mac);
    W5100.getIPAddress(ip);

    // Clear init flag and do a reset and init
    initialized = false;
    softReset();
    bool initResponse = init();

    // Put back the configuration
    W5100.setGatewayIp(gw);
    W5100.setSubnetMask(sn);
    W5100.setMACAddress(mac);
    W5100.setIPAddress(ip);
}
fredilarsen commented 5 years ago

I can confirm that all the changes above together detects and repairs the issue. The test program now runs fine, showing a growing number of failures without other effects than a temporary dip in the transfer speed when it happens. No changes in user sketches are needed to fix the problem.

I could send a pull request for this, but I am not sure if you like the approach of adding the two global variables to avoid having to change parameters to/from existing functions (EthernetClass::socketSendUDP would have to have a second parameter udp_send_error it could modify, or it could be added as a member of the Ethernet class).

Please advise. Should I send a pull request as-is (like above) or do you prefer them as members of the EthernetClass or as function parameters?

If you know of some way to get the W5x00 out of the error state, other than the full reset+init, it would be great. I did not find any, but my experience with the W5x00 is not that deep.

gioblu commented 5 years ago

Ciao @fredilarsen I confirm I was able to replicate this issue. I see this as a potential attack vector specially if available in the public domain, anyone with bad intentions can forcefully disconnect / render useless the wiznet chip until manual or software reset using just flooding and some patience until it freezes (in few minutes), probably it is better to fix this as soon as possible.

fredilarsen commented 5 years ago

The above example to reproduce this is using UDP broadcast. I can add that this happens also when not using broadcast but sending directly to specific IPs.

fredilarsen commented 5 years ago

My automation system with the fix has been running for 72 days continuously now, with an average packet count of about 300 packets/minute. This would lock up within hours to a few days previously.

The fix can be found here: https://github.com/fredilarsen/Ethernet

gert3d commented 5 years ago

I am using a W5500 lite on a Arduino Due to retrieve the NTP Unix time. If the module is not attached, the function Udp.endPacket() runs forever. I hoped your fix would also have solved this, but it does not. Does this make sense ? thanks

fredilarsen commented 5 years ago

I am using a W5500 lite on a Arduino Due to retrieve the NTP Unix time. If the module is not attached, the function Udp.endPacket() runs forever. I hoped your fix would also have solved this, but it does not. Does this make sense ? thanks

It may be locked somewhere else if the card is not attached, I have not investigated this. Is this a problem that could occur outside of test setups (why would the network card be detached)?

The method I used for locating the other eternal loop was to add Serial.print+flush statements to the library code to trace how far it got and what return values were received.

bleckers commented 4 years ago

I can confirm that this happens in UDP and TCP land and the changes by @fredilarsen (as appropriate in the TCP side too) fix this. I was having frequent lockups which would trigger a watchdog and this solved it.

gioblu commented 4 years ago

Sorry if I in part repeat what I have already stated above, I can also confirm the occurrence of this when using UDP, not sure about TCP. I can confirm that the impressive work of @fredilarsen fixes the issue (I have tested only UDP).

This is quite critical, for any practical use case of the library and the shield, considering it can crash arbitrarily after some time. This issue is open since november 2018, I hope someone will fix it and I say so because anyone aware of this can use it as an attack vector and remotely crash a target device with just bare flooding.

bleckers commented 4 years ago

It might be worthwhile to put this over at https://github.com/PaulStoffregen/Ethernet . @PaulStoffregen mentioned he no longer has write access to this repo and to flag issues on his "Fork" (considering this repo is originally his Fork).

There's a few other places that get stuck (like flush() in EthernetClient) if there is network, but no connection to the server.

This fixes flush:

void EthernetClient::flush()
{
    unsigned long start = millis();
    while (sockindex < MAX_SOCK_NUM) {
        if (millis() - start > _timeout) 
        {
            stop();
            return;
        }
        uint8_t stat = Ethernet.socketStatus(sockindex);
        if (stat != SnSR::ESTABLISHED && stat != SnSR::CLOSE_WAIT) return;
        if (Ethernet.socketSendAvailable(sockindex) >= W5100.SSIZE) return;
    }
}

I have PRed this fix into Paul's repo.

I can also confirm this does get into an infinite loop too:

int EthernetUDP::parsePacket()
{
    // discard any remaining bytes in the last packet
    while (_remaining) {
        // could this fail (loop endlessly) if _remaining > 0 and recv in read fails?
        // should only occur if recv fails after telling us the data is there, lets
        // hope the w5100 always behaves :)
        read((uint8_t *)NULL, _remaining);
    }

Dealing with this is going to be application specific however, because what do you return? Make changes as appropriate.

fredilarsen commented 4 years ago

@bleckers Thanks for the information that Paul does not have write access to this repo any more.

I created a new ticket in the new repo, pointing back here for details: https://github.com/PaulStoffregen/Ethernet/issues/46