freifunk-gluon / gluon

a modular framework for creating OpenWrt-based firmwares for wireless mesh nodes
https://gluon.readthedocs.io
Other
544 stars 325 forks source link

network interface ends up in internal loopback mode #1553

Closed kpanic23 closed 4 years ago

kpanic23 commented 5 years ago

In rare cases, a node's network interface seems to end up in internal loopback mode, internally bridging RX and TX. To this day I've noticed it twice, in both cases on an WR841N. Rebooting the router doesn't help at all.

dmesg is full of "br-client: received packet on eth0 with own address as source address" messages with the system load going through the roof. On the current case, nothing is connected to the router's LAN ports. Setting eth0 down via "ip link set eth0 down" instantly fixes the problem.

It might be possible, that some more routers I don't have access to have the same problem, resulting in high load.

On a previous case, the WAN port was affected. The node had Mesh-on-LAN and Mesh-on-WAN active with only a cable being plugged in in one of the LAN ports. So disabling Mesh-on-WAN "fixed" the issue.

EDIT: To make things a little clearer: The current case affects a router which has no neighbor, neither wired nor wireless, and doesn't even have any clients currently connected. load

neocturne commented 5 years ago
neocturne commented 5 years ago

Also, it would be interesting to find out which component causes the issue. There are three different loopback modes built into the QCA953x: MAC, switch and PHY. The register definitions can be found in the datasheets at https://github.com/Deoptim/atheros .

kpanic23 commented 5 years ago
  • Which revisions of WR841N did you experience the issue on?

The first time it was a V11, this time it's a V9

  • Were you able to check if a power cycle fixes the issue?

The node has had a load of >5 for the last couple of months. It had been restarted a couple of times, but the load always spiked up again after a couple of minutes.

  • Did the devices report a carrier on the affected ports? (I assume it does, as otherwise the kernel should not attempt to send out packets)

I don't know, but I could check. I just need to know, how :)

  • Neither the MAC nor the PHY loopback modes simulate a link, and I did not find a feature to force a link in the datasheet, so in my experiments packets were only looped there was a cable connected to the ports in question.

Then maybe it's just a hardware fault?

neocturne commented 5 years ago
  • Which revisions of WR841N did you experience the issue on?

The first time it was a V11, this time it's a V9

  • Were you able to check if a power cycle fixes the issue?

The node has had a load of >5 for the last couple of months. It had been restarted a couple of times, but the load always spiked up again after a couple of minutes.

  • Did the devices report a carrier on the affected ports? (I assume it does, as otherwise the kernel should not attempt to send out packets)

I don't know, but I could check. I just need to know, how :)

ip link shows NO-CARRIER for interfaces that are up, but don't have a carrier.

  • Neither the MAC nor the PHY loopback modes simulate a link, and I did not find a feature to force a link in the datasheet, so in my experiments packets were only looped there was a cable connected to the ports in question.

Then maybe it's just a hardware fault?

If even a hard reset (power cycle, not just a reboot) did not fix the issue, it is likely faulty hardware. No register states should survive a power cycle (most are even reset on a soft reset).

kpanic23 commented 5 years ago

eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master br-client state UP qlen 1000 link/ether 14:cc:20:71:52:60 brd ff:ff:ff:ff:ff:ff

kpanic23 commented 5 years ago

Okay, I had set the interface up again for this test. 10 minutes later the load was at 8.84 Interface down, load falls again.

Now I'm wondering: eth0 is wired internally to a port in the built-in switch. Shouldn't it then always have a carrier, even when no LAN port has a cable plugged in?

neocturne commented 5 years ago

On the 841, the CPU port only has link when at least one of the hardware ports has link.

What does swconfig dev eth0 show show?

kpanic23 commented 5 years ago

Here's the output of the V11 I had the problem on about a year ago. But there the WAN port is affected. I had worked around it by setting the link down in rc.local...

kursaal-regie.txt

kpanic23 commented 5 years ago

A-HA!

link: port:0 link:up speed:1000baseT full-duplex txflow rxflow link: port:1 link:up speed:100baseT full-duplex auto link: port:2 link:down link: port:3 link:down link: port:4 link:down

neocturne commented 5 years ago

Would it be possible to install a Gluon master on that node? The swconfig output is a bit more comprehensive in the new OpenWrt version.

kpanic23 commented 5 years ago

It already is running today's master build... feuerwehr-switch.txt

I just only copied the port lines ;)

rotanid commented 5 years ago

@kpanic23 what's the status here, was it fixed running master or are you missing feedback from "us" ? if so, don't hesitate to ask again via comment or IRC after a while, ideally directly adressing someone who could know more about the issue

mweinelt commented 4 years ago

A bugfix for this behaviour seems unlikely and as the ar71xx-tiny target is deprecated I don't expect anything signifcant to happen here anymore.

Closing.