OE4T / meta-tegra

BSP layer for NVIDIA Jetson platforms, based on L4T
MIT License
415 stars 228 forks source link

Jetson TX2 Warrior IPv4 networking Issues #146

Closed dwalkes closed 5 years ago

dwalkes commented 5 years ago

Hi Everyone, I’m experiencing an issue with IPv4 networking on my meta-tegra build for Jetson TX2, warrior branch and I was wondering if anyone can provide any suggestions about the best troubleshooting steps.

I’ve found that when I build a combined mender image using this project and then boot up, I can access IPV4 network addresses for something between 1 and 10 minutes, then I lose all IPV4 connection. Any attempts to access returns “Network is unreachable” or similar error messages. Here’s a script I’ve been using to reproduce

#!/bin/bash
CONTINUE=1
STARTTIME=`date`
#ADDRESS="fe80:0:0:0:250:f1ff:fe80:0"
ADDRESS="192.168.1.1"
echo "ping test started at $STARTTIME pinging address ${ADDRESS}"
while [ $CONTINUE -eq 1 ]; do
        ping -c 1 -W 1 ${ADDRESS}  > /dev/null
        if [ $? -ne 0 ]; then
                CONTINUE=0
                FAILTIME=`date`
                echo "Ping test Failed at $FAILTIME"
        else
                sleep 1
        fi
done

This script fails after something between 1 to 15 minutes of running. After the failure the only reliable way to get an IPV4 connection again is to reboot. In some cases dropping and re-upping the link brings it back, and in some cases ethtool manipulation (I’ve used ethtool -s eth0 speed 100 duplex full autoneg off) brings it back, but not always in either case.

I don’t see any helpful or obviously suspicious messages in journalctl or /var/log/messages but I did see this message in a few instances:

[  191.898027] A link change request failed with some changes committed already. Interface eth0 may have been left with an inconsistent configuration, please check.

I’ve reproduced this behavior if I remove the mender layer and just use meta-tegra + poky, although in the default configuration I need to add this content to /etc/systemd/network/eth.network to get an IPV4 address:

[Match]
Name=eth*

[Network]
DHCP=v4

[DHCPv4]
UseHostname=false

And then restart networking with systemctl restart systemd-networkd.service

Unless I’m confusing myself (which is entirely possible) I think this suggests there’s something in the nvidia L4T root filesystem which is required for IPv4 networking which isn’t in my core-image-base rootfs. I’m planning to look in more detail at the systemd configuration between core-image-base and the L4T root filesystem but I’m wondering if anyone else has noticed this yet or has suggestions about what I should try.

quaresmajose commented 5 years ago

I have some problems with LLDP packets with TX2 28.3.

This problems is present when i use the systemd networkd. In nvidia SDK i can’t reproduce the problem, because the sdk use the NetwokManager instead of the systemd networkd.

The problem happens when I receive an LLDP packet. The packet is not showed in tcpdump and after that packet is received, the network driver don't generate more interrupts. check that with this:

check if you can receive any LLDP network packet:

tcpdump -ni eth0 -e ether proto 0x88cc
[  244.935449] Unsupported IOCTL call
[  244.954384] Unsupported IOCTL call
[  244.966121] device eth0 entered promiscuous mode

if you receive any LLDP packets you don't have this problem

at the same time you can check the interrupts on netwok driver:

egrep ether_qos /proc/interrupts

when the tegra receive an LLDP packet the interrupts stops and it don't receive anything

the network can be reseted with: ip link set eth0 down; ip link set eth0 up

I think there are something used by systemd networkd that is not present on the ether_qos network driver. To fix that, i disable LLDP on systemd networkd in /etc/systemd/network/eth.network

[Network]

# fix LLDP on TX2
LLDP=0
dwalkes commented 5 years ago

Thanks for the suggestion @tzopik When I try

 tcpdump -ni eth0 -e ether proto 0x88cc

I get

[  197.254281] device eth0 entered promiscuous mode
[  197.259029] audit: type=1700 audit(1563200454.016:4): dev=eth0 prom=256 old_prom=0 auid=4294967295 uid=0 gid=0 ses=4294967295
[  197.270552] audit: type=1300 audit(1563200454.016:4): arch=c00000b7 syscall=208 success=yes exit=0 a0=3 a1=107 a2=1 a3=7fc84ec540 items=0 ppid=4289 pid=5050 auid=4294967295 uid=0 gid=0 eu)
[  197.297531] audit: type=1327 audit(1563200454.016:4): proctitle=74637064756D70002D6E690065746830002D650065746865720070726F746F00307838386363

After the event occurs the interrupts keep advancing on common_irq

root@jetson-tx2:~# egrep ether_qos /proc/interrupts
 41:        130          0          0          0     GICv2 226 Level     ether_qos.common_irq
 43:        469          0          0          0     GICv2 222 Level     2490000.ether_qos.rx0
 44:        223          0          0          0     GICv2 218 Level     2490000.ether_qos.tx0
root@jetson-tx2:~# egrep ether_qos /proc/interrupts
 41:        130          0          0          0     GICv2 226 Level     ether_qos.common_irq
 43:        469          0          0          0     GICv2 222 Level     2490000.ether_qos.rx0
 44:        223          0          0          0     GICv2 218 Level     2490000.ether_qos.tx0

So it looks like this particular issue might be different than 28.3 but possibly related.

xkentr commented 5 years ago

@tzopik, @dwalkes I tried LLDP=0 on meta-tegra/warrior and this seems to resolve this issue for me. Ethernet connection has been stable for 10 minutes+, and it recovers properly from link up/down. Previously, I could not get Ethernet to work at all when the device was connected to MikroTik or Ubiquiti routers.

Contents of /etc/systemd/network/eth.network I tested this on:

Name=eth*

[Network]
DHCP=v4
LLDP=0

[DHCPv4]
UseHostname=false

Thanks for the hint.

madisongh commented 5 years ago

I haven't had a problem with the Ethernet interface coming up on warrior, but I was using ifupdown and the networking initscript instead of systemd-networkd. I did have to modify how ifupdown was invoking udhcpc to deal with the delay caused by spanning tree on my Ubiquity switch.

I've just switched over to systemd-networkd, and am still not seeing an issue with warrior - Ethernet was up over half an hour. I'm running both IPv4 and IPv6 (just link-local) on my local network. Are you all running any IPv6?

dwalkes commented 5 years ago

I tried LLDP=0 on meta-tegra/warrior and this seems to resolve this issue for me.

Agreed, I see the same behavior, LLDP=0 has gone for over an hour of successful pingtest on IPv4. My previous record was just over 10 minutes.

Are you all running any IPv6?

Only for test purposes related to this issue.

I've pushed a workaround to my meta-mender-community tegra layer for now, since mender is where the systemd networkd dependency is coming from.

madisongh commented 5 years ago

I haven't been able to track down the actual root cause yet, but for now I've added a replacement for the standard systemd wired network config file that OE-Core provides which adds the LLDP=no setting.

Disabling ipv6 is another approach that seems to work, if you really want LLDP turned on.

compenguy commented 5 years ago

The commit to the warrior branch doesn't do anything because the upstream systemd recipe in the openembedded-core warrior branch doesn't reference wired.network.

If I add this to the recipe

SRC_URI += "file://wired.network"
FILES_${PN} += "${systemd_unitdir}/network/80-wired.network"

do_install_append () {
    install -D -m0644 ${WORKDIR}/wired.network ${D}${systemd_unitdir}/network/80-wired.network
}

Then it correctly includes the wired.network file in the rootfs, and fixes the network issue for me as well (with LLDP enabled, the ethernet link goes down in <30 seconds after it comes up for me).

madisongh commented 5 years ago

Thanks @compenguy for catching that. Will post a fixup shortly.

madisongh commented 5 years ago

This should be working now in warrior and master with the workarounds applied. Feel free to reopen if the issue reappears.

maxlapshin commented 4 years ago

Thank you a lot for this comment.

I've spent several days trying to catch the issue and couldn't even guess that there can be such a problem.