arduino / ArduinoCore-mbed

347 stars 202 forks source link

[BUG] DHCP Ethernet crashes after repeated switching of network interface [OPTA/Portenta H7] #891

Closed Channel59 closed 4 months ago

Channel59 commented 5 months ago

I experience some issues with the network interface of the OPTA (probably also the portenta H7 suffers from this problem).

I am not sure if only the OPTA is affected, and whether or not this issue is OPTA related, arduinoCore-mbed related or really mbed-os related, or a combination of those.

Setup:

PC with multiple eth interface, Opta directly connected to eth0. The PC has a static IP and runs a DHCP server. ArduinoCore-mbed version: 4.0.10

Issue:

If from the PC the interface eth0 is brought up and down multiple times, the OPTA sees this and the LAN light turn off when the interface is down and turn back on when the interface is up. If the OPTA is configured to use a static IP this gives no issues. However, when DHCP is enabled, the OPTA seems to be unable to reconnect to the network. It cannot be pinged either. Also, the network stack of the OPTA does not seem to notice this, because the state will actually become NSAPI_STATUS_GLOBAL_UP, but it does not respond to pings.

The same behavior can also be reproduced by repeatedly pulling the inserting the ethernet cable in the RJ-45 port.

With a static IP this seems not to be an issue.

Recovering from this state

It seems that the only way to recover from this state is by resetting the OPTA, which is undesirable. It seems that removing and re-inserting the ethernet cable can also recover the OPTA from this state.

Code on the OPTA:

This issue happened when using Arduino_ConnectionHandler, just the PortentaEthernet class from the arduinocore, and it also happen when I just use the mbed EthernetInterface.

The simplest way to reproduce the problem is with this code:

#include <Arduino.h>
#include "EthernetInterface.h"

auto eth = EthernetInterface::get_default_instance();

void status_callback(nsapi_event_t status, intptr_t param) ;

void setup() {
  eth->set_blocking(true);
  eth->attach(&status_callback);
  eth->set_dhcp(true);
  eth->connect();
  eth->set_blocking(false);  
}

void status_callback(nsapi_event_t status, intptr_t param) {
    Serial.println("Connection status changed!");
    switch (param) {
        case NSAPI_STATUS_LOCAL_UP:
            Serial.println("Local IP address set!");
            break;
        case NSAPI_STATUS_GLOBAL_UP:
            Serial.println("Global IP address set!");
            break;
        case NSAPI_STATUS_DISCONNECTED:
            Serial.println("No connection to network!");
            break;
        case NSAPI_STATUS_CONNECTING:
            Serial.println("Connecting to network!");
            break;
        default:
            Serial.println("Not supported");
            break;
    }
}

The script to break the opta:

---break-opta.sh---
#!/bin/bash
for j in {1..5}
do
    ifconfig eth0 down
    sleep 2
    ifconfig eth0 up
    sleep 2
done
if ping -c 1 192.168.0.100
then
    echo "."
else
    echo "X"
    exit 1
fi

Outputs:

First we ping the opta to check if it has gotten a DHCP lease:

$ ping 192.168.0.100
PING 192.168.0.100 (192.168.0.100) 56(84) bytes of data.
64 bytes from 192.168.0.100: icmp_seq=1 ttl=255 time=341 ms
64 bytes from 192.168.0.100: icmp_seq=2 ttl=255 time=0.195 ms
64 bytes from 192.168.0.100: icmp_seq=3 ttl=255 time=0.172 ms
^C
--- 192.168.0.100 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2019ms
rtt min/avg/max/mdev = 0.172/113.680/340.674/160.508 ms

This seems normal. Then we run the break-opta.sh script.

$ sudo ./break-opta-dhcp.sh 
PING 192.168.0.100 (192.168.0.100) 56(84) bytes of data.
From 192.168.0.1 icmp_seq=1 Destination Host Unreachable

--- 192.168.0.100 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
X

The opta cannot be reached anymore. If we look at the serial output from the opta, it seems that everything is fine:

Connection status changed!
Connecting to network!
Connection status changed!
Global IP address set!
Connection status changed!
Connecting to network!
Connection status changed!
Global IP address set!
Connection status changed!
Connecting to network!
Connection status changed!
Global IP address set!
Connection status changed!
Connecting to network!
Connection status changed!
Global IP address set!
Connection status changed!
Connecting to network!
Connection status changed!
Global IP address set!

A reset of the opta seems to be the only way to recover from this state.

Attempts to recover without resetting

Several things I have tried (behind a button push in the loop() function):

eth->set_blocking(true);
eth->disconnect();
eth->set_dhcp(false);
eth->set_dhcp(true);
eth->disconnect();

Furthermore, I tried to build the project with the lwip headers included, by adding these values to variants/OPTA/includes.txt:

-iwithprefixbefore/mbed/connectivity/lwipstack
-iwithprefixbefore/mbed/connectivity/lwipstack/include
-iwithprefixbefore/mbed/connectivity/lwipstack/include/lwipstack
-iwithprefixbefore/mbed/connectivity/lwipstack/lwip
-iwithprefixbefore/mbed/connectivity/lwipstack/lwip-sys
-iwithprefixbefore/mbed/connectivity/lwipstack/lwip-sys/arch
-iwithprefixbefore/mbed/connectivity/lwipstack/lwip/src
-iwithprefixbefore/mbed/connectivity/lwipstack/lwip/src/include
-iwithprefixbefore/mbed/connectivity/lwipstack/lwip/src/include/compat
-iwithprefixbefore/mbed/connectivity/lwipstack/lwip/src/include/compat/posix
-iwithprefixbefore/mbed/connectivity/lwipstack/lwip/src/include/compat/posix/arpa
-iwithprefixbefore/mbed/connectivity/lwipstack/lwip/src/include/compat/posix/net
-iwithprefixbefore/mbed/connectivity/lwipstack/lwip/src/include/compat/posix/sys
-iwithprefixbefore/mbed/connectivity/lwipstack/lwip/src/include/lwip
-iwithprefixbefore/mbed/connectivity/lwipstack/lwip/src/include/lwip/priv
-iwithprefixbefore/mbed/connectivity/lwipstack/lwip/src/include/lwip/prot
-iwithprefixbefore/mbed/connectivity/lwipstack/lwip/src/include/netif

This allowed me to access the functions, calling them after the button push. None of these seemed to be able to recover the network connection.

struct netif *find_netif() {
    struct netif *netif = netif_list;
    while (netif) {
        if (netif->flags & NETIF_FLAG_ETHERNET) {
            return netif;
        }
        netif = netif->next;
    }
    return NULL;
}
struct netif *netif = find_netif();

dhcp_network_changed(netif);
sys_check_timeouts();
dhcp_renew(netif);

None of these things seem to be able to restore the network status.

Does anyone have an idea as to how to fix this?

maidnl commented 4 months ago

@Channel59 thank you for submitting this problem. The patch should solve it. I have just a couple of remarks about the way you replicate the problem:

  1. your DHCP server must be configured in a way that the OPTA gets always the same ip address (so that at a certain MAC address is always associated the same ip address). Otherwise OPTA could be assigned a different ip address and you so fail the ping because you are using the wrong IP address (the old one).
  2. DHCP assign address could take a bit so it is maybe useful to wait just a little after the interface goes up, before to ping again the device Thanks!
Channel59 commented 4 months ago

Dear @maidnl,

Thank you very much for taking the time to investigate this problem. I agree with your remarks regarding the replication of the problem.

I have tested your patch and it seems like the issue is partly resolved. After applying this patch on ArduinoCore-mbed 4.0.10 and the mbed-os-arduino 4.0.10, it seems that the network interface of the opta does not get stuck after the first iteration of the shell script to test the Opta.

However, the opta can still become unresponsive quite easily, using e.g. this code:

#include <Arduino.h>
#include "EthernetInterface.h"
#include "mbed.h"
auto net = EthernetInterface::get_default_instance();

void setup(){
    net->set_blocking(true);
    net->set_dhcp(true);
    net->connect();
}

void loop() {
    rtos::ThisThread::sleep_for(1000);
}

and by simply using this commands (or unplugging the ethernet cable for 3 seconds and re-plugging it):

sudo ifconfig eth0 down && sleep 3 && sudo ifconfig eth0 up

This is already enough to break it, however re-running this will also restore the connecition

With static IP there is no issue it seems.

or with this more elaborate script, including your wait time:

#!/bin/bash

opta=192.168.0.100
eth=eth0

for i in {1..2000}
do
        for j in {1..5}
        do
                t_down=$((500 + $RANDOM % 5000))
                t_up=$((1000 + $RANDOM % 2000))
                echo -ne "_"
                ifconfig $eth down
                ./usleep $(($t_down * 1000))
                echo -ne "\b-"
                ifconfig $eth up
                ./usleep $(($t_up * 1000))
                echo -ne "\b."
        done

        if (( $t_up<3000 )); then
                t_wait=$((3000 - t_up))
                ./usleep $(($t_wait * 1000))
        fi
    sleep 2
        if ping -c 1 $opta &> /dev/null
        then
                echo -n "|"
        else
                echo "X"
                exit 1
        fi
done

(Use e.g. this for usleep: https://github.com/pklaus/usleep-binary)

In this case, another ethernet interface down/up from the host seems to resolve the problem and the opta starts sending DHCP requests again and obtains an ip.

It seems there is still something stange going on with the DHCP.

As a side-note, that may or may not be related, I have done some more testing of the Opta in the context of ethernet disconnect/reconnects, and have found an issue that is seems to be related UDP multicast (see:https://github.com/arduino/ArduinoCore-mbed/issues/904). I assume the DHCP uses somewhere under the hood also some UDP multicast socket, and thus these issues might be related / the same.

Channel59 commented 4 months ago

I have done a little bit of testing and it seems that this issue and also https://github.com/arduino/ArduinoCore-mbed/issues/904 are resolved with https://github.com/arduino/ArduinoCore-mbed/commit/f97304bffff1852de6ebf10ccbb941f431b70b16.

maidnl commented 4 months ago

@Channel59 Thank you for test it, we hoped that this would fix this problem.

This issue is then closes thanks to commit 97304bffff1852de6ebf10ccbb941f431b70b16