Closed kpanic23 closed 7 years ago
we only have one node where this problem re-appears often. also no serial possible there. it's a picostation m2 without VPN which often has a lot of clients and some mesh-partners. the issue isn't related to hopglass-server or respondd or alfred, it's just that the neighbor-nodes still report they see the broken node, as layer2 still works.
@kpanic23 did you see this issue on nodes that have VPN-link? i did not. i suspect that this is the same as #605 or at least similar to it. my workaround script probably doesn't catch it because it does assume the connection is great if a batctl ping to the gateway succeeds.
For someone capable of reproducing the issue frequently on a specific node: Could you assign a static IPv4 address on br-client and check whether both IPv4 and IPv6 or just one of these protocol families fails? A long(er) time ping run to compare packetlosses of IPv4 vs. IPv6 could be interesting, too.
Also, could affected people tell their Gluon version and link their site.conf?
as i don't have easy access to a node that often has the issue, we have to count on @kpanic23 . regarding gluon version, they and we are both using 2016.2.4, our site.conf is there: https://github.com/tecff/site-ffa/blob/stable/site.conf (technically the issue happend with 2016.2.3 - the .4 update took place today)
@rotanid I don't think it's the same as #605 , because it also happens on nodes which are solely meshing via cable. I'd rather think it might be something in batman-adv.
Our site(s) are here: https://github.com/ff3l/site-ff3l/tree/2016.2
The main problem in trying to debug this, is, you really don't have the slightest idea which node is affected next. As I have written, one of my nodes (which unfortunatelly is an WA860RE, so no quick opening and connecting a serial interface) had this problem twice, and after power-cycling the last time that particular node has been running flawlessly for months. I don't have the slightest idea either how to provoke this behaviour or how to predict where it will happen next. Frankly, I'm totally clueless.
ok, didn't know about cable-mesh nodes. we have a node where it happens often, but we also have no possibility to add a console.
I maybe had a similar issue, the node (Ubiquiti Outdoor+) was powered up but after some time it was shown as offline (meshviewer). It has mesh-on-wan and 11s-mesh enabled. Via link-local on wan interface I was able to get a ssh session. It seems the bat0 interface was down, after a reboot everything is fine.
were you able to look at logread/dmesg via the link-local ssh connection?
maybe related to b7eeef9b04b44a70b2a953c4efe35a3fdceba2db ?
root@ff3l-GU-Haltingen-Offloader:~# ping6 fe80::62e3:27ff:fee7:56ce%br-mesh_lan
PING fe80::62e3:27ff:fee7:56ce%br-mesh_lan (fe80::62e3:27ff:fee7:56ce%br-mesh_lan): 56 data bytes
^C
--- fe80::62e3:27ff:fee7:56ce%br-mesh_lan ping statistics ---
11 packets transmitted, 0 packets received, 100% packet loss
root@ff3l-GU-Haltingen-Offloader:~# batctl p 60:e3:27:e7:56:ce
PING 60:e3:27:e7:56:ce (02:1b:28:67:eb:53) 20(48) bytes of data
20 bytes from 60:e3:27:e7:56:ce icmp_seq=1 ttl=50 time=0.12 ms
20 bytes from 60:e3:27:e7:56:ce icmp_seq=2 ttl=50 time=0.12 ms
20 bytes from 60:e3:27:e7:56:ce icmp_seq=3 ttl=50 time=0.11 ms
^C--- 60:e3:27:e7:56:ce ping statistics ---
3 packets transmitted, 3 received, 0% packet loss
rtt min/avg/max/mdev = 0.110/0.115/0.119/0.004 ms
root@ff3l-GU-Haltingen-Offloader:~#
@kpanic23 Your ping6 command looks wrong, the IP address used on the lower layers (br-wan/br-mesh_lan) should look completely different from the primary MAC address, so this can't work.
Oops... I've copied the wrong lines while desperately trying to communicate on any interface. Nevertheless, you can't ping that node on any interface with any IP address. It seems to be completely deaf.
Why I'm especially interested in IPv4 vs. IPv6 comparisons: I saw that your latest site version adds the gluon-ebtables-segment-mld package and statistics regarding IPv4 vs. IPv6 could be an indicator whether your issues might be somehow related to that package. That is, if IPv4 works and IPv6 doesn't then it is quite likely an issue regarding bridge multicast snooping. If both fail, then it has nothing to do with bridge multicast snooping.
Maybe you could just add static IPv4 addresses to some random nodes you have access to and run ping/ping6 in the background for a few days, while doing a "tcpdump 'icmp or icmp6' capture?
Are all your nodes involved running Gluon >= 2016.2 (in particular the nodes you are trying to ping from/to)? Are your gateway nodes involving bridges on top of bat0 (and if yes, do they have multicast_router=2 for bat0 set)?
@T-X Yes, the gluon-ebtables-segment-mld package is added since we use 2016.2.1 Our gateways do indeed use bridges on top of bat[0-32], and no, they have multicast_router set to 1 Might that be the problem? Shall I change that setting to 2? Or maybe build a new firmware omitting that package? Only 9 of our nodes are currently running gluon <2016.2, the majority uses 2016.2.4 (See statistics on https://map.ff3l.net)
Ah, okay, the missing multicast_router = 2 could be a probable cause, yes! Your map server is running on some such gateway? Did you also run the ping6 tests from your gateway?
For clarification: multicast_router=2 forces all multicast traffic onto this bridge port, no matter if it detected a multicast router or multicast listener behind it. This is needed now because with the segment-mld package we shrink the horizon of the bridge concerning multicast listeners (by refusing to forward IGMP/MLD into the mesh). Similar to the unicast MAC address in the bridge, batman-adv takes over this role in between now (and distributes multicast listener status in a way more efficient than IGMP/MLD, via the reactive batman-adv Translation Table).
A multicast router port means, that all multicast traffic will be copied here, even if no multicast listener was detected.
EDIT: To avoid confusion: Only on bat0 setting multicast_router=2 should be needed. (hrm, even after explaining it a few times on IRC I still suck at explaining this, and I guess people will continue to stumble upon this... I'll try to add one more sentence to the package description)
we have a similar setup, most nodes run >=2016.2.4. at the moment, there's only one node showing the issue but that didn't change after doing "echo 2 > /sys/class/net/bat0/brport/multicast_router"
Okay. If the IPv4 vs. IPv6 ping is too difficult for you for now, here is an easier way we could check first: a) what does "ip neigh" say, is the IPv6 address resolved fine or is it already stuck at address resolution? If it is stuck at address resolution, next to a missing entry you should see "ICMPv6 Neighbor Solicitations" instead of ICMPv6 echo requests with tcpdump - if that's the case could you check whether setting an ipv6 address <-> MAC address entry manually helps?
@T-X I have tested pinging from my client, the gateways and one of the other nodes in the local mesh. No change. Should multicast_router=2 be set on the bat interface or on the bridge containing the bat interface? Also, where shall I run "ip neigh"? And where tcpdump? On gluon there is no "ip neigh". On the gateway it dumps me a list of thousands of IPv4/MAC addresses. Nothing gets stuck.
Also, I don't understand why a setting on the gateway disrupts IPv6 connectivity in the local mesh (which should even work without any connection to any gateway)
Hi,
I have a node with the very same issue, but I managed to connect to it via the link local address of ibss0.2 (We use vlans ontop of the ibss interfaces). The node seems to not apply the IP-Addresses to br-client and bat0 is not part of the bridge:
root@Hinterhof-LinkesZentrum-Neu:~# ip a show dev br-client
15: br-client: <BROADCAST,MULTICAST> mtu 1500 qdisc noop
link/ether e6:5f:e0:41:97:2a brd ff:ff:ff:ff:ff:ff
root@Hinterhof-LinkesZentrum-Neu:~# brctl show
bridge name bridge id STP enabled interfaces
br-mesh_lan 7fff.8ec7ee4ea314 no eth0
br-wan 7fff.8ec7ee4ea310 no eth1
br-client 8000.e65fe041972a no client0
root@Hinterhof-LinkesZentrum-Neu:~# batctl if
ibss0.2: active
primary0: active
br-wan: active
Since bat0 is not part of br-client it explains why the node is not reachable with anything but batctl ping
A reboot usually fixes this and it also only occurs randomly after a reboot of a node. I suspect that some interface is not yet ready when its being configured. I'm trying to find more on the issue.
Hardware: TP-Link TL-WR841N/ND v10 Gluon version: 2016.2.4
The node doesn't use mesh vpn but has mesh on lan activated.
Update:
From what I can see in the kernel logs, bat0 gets added to the bridge but just moments later it is removed again.
[ 0.000000] Linux version 3.18.44 (jenkins@lambdacore) (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r49261) ) #11 Sat Mar 25 01:32:59 CET 2017
[ 0.000000] MyLoader: sysp=8d0734c9, boardp=6e1d1b6c, parts=d7a9fd94
[ 0.000000] bootconsole [early0] enabled
[ 0.000000] CPU0 revision is: 00019374 (MIPS 24Kc)
[ 0.000000] SoC: Qualcomm Atheros QCA9533 ver 2 rev 0
[ 0.000000] Determined physical RAM map:
[ 0.000000] memory: 02000000 @ 00000000 (usable)
[ 0.000000] Initrd not found or empty - disabling initrd
[ 0.000000] Zone ranges:
[ 0.000000] Normal [mem 0x00000000-0x01ffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x00000000-0x01ffffff]
[ 0.000000] Initmem setup node 0 [mem 0x00000000-0x01ffffff]
[ 0.000000] On node 0 totalpages: 8192
[ 0.000000] free_area_init_node: node 0, pgdat 803c2a30, node_mem_map 81000000
[ 0.000000] Normal zone: 64 pages used for memmap
[ 0.000000] Normal zone: 0 pages reserved
[ 0.000000] Normal zone: 8192 pages, LIFO batch:0
[ 0.000000] Primary instruction cache 64kB, VIPT, 4-way, linesize 32 bytes.
[ 0.000000] Primary data cache 32kB, 4-way, VIPT, cache aliases, linesize 32 bytes
[ 0.000000] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[ 0.000000] pcpu-alloc: [0] 0
[ 0.000000] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 8128
[ 0.000000] Kernel command line: board=TL-WR841N-v9 console=ttyS0,115200 rootfstype=squashfs,jffs2 noinitrd
[ 0.000000] PID hash table entries: 128 (order: -3, 512 bytes)
[ 0.000000] Dentry cache hash table entries: 4096 (order: 2, 16384 bytes)
[ 0.000000] Inode-cache hash table entries: 2048 (order: 1, 8192 bytes)
[ 0.000000] Writing ErrCtl register=00000000
[ 0.000000] Readback ErrCtl register=00000000
[ 0.000000] Memory: 27996K/32768K available (2855K kernel code, 151K rwdata, 576K rodata, 248K init, 200K bss, 4772K reserved)
[ 0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[ 0.000000] NR_IRQS:51
[ 0.000000] Clocks: CPU:650.000MHz, DDR:392.602MHz, AHB:216.666MHz, Ref:25.000MHz
[ 0.000000] Calibrating delay loop... 432.53 BogoMIPS (lpj=2162688)
[ 0.060000] pid_max: default: 32768 minimum: 301
[ 0.060000] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
[ 0.070000] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
[ 0.080000] NET: Registered protocol family 16
[ 0.080000] MIPS: machine is TP-LINK TL-WR841N/ND v9
[ 0.530000] Switched to clocksource MIPS
[ 0.540000] NET: Registered protocol family 2
[ 0.540000] TCP established hash table entries: 1024 (order: 0, 4096 bytes)
[ 0.540000] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
[ 0.550000] TCP: Hash tables configured (established 1024 bind 1024)
[ 0.560000] TCP: reno registered
[ 0.560000] UDP hash table entries: 256 (order: 0, 4096 bytes)
[ 0.570000] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
[ 0.570000] NET: Registered protocol family 1
[ 0.580000] PCI: CLS 0 bytes, default 32
[ 0.590000] futex hash table entries: 256 (order: -1, 3072 bytes)
[ 0.610000] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 0.610000] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
[ 0.620000] msgmni has been set to 54
[ 0.630000] io scheduler noop registered
[ 0.630000] io scheduler deadline registered (default)
[ 0.640000] Serial: 8250/16550 driver, 16 ports, IRQ sharing enabled
[ 0.650000] console [ttyS0] disabled
[ 0.670000] serial8250.0: ttyS0 at MMIO 0x18020000 (irq = 11, base_baud = 1562500) is a 16550A
[ 0.680000] console [ttyS0] enabled
[ 0.690000] bootconsole [early0] disabled
[ 0.700000] m25p80 spi0.0: found gd25q32, expected m25p80
[ 0.700000] m25p80 spi0.0: gd25q32 (4096 Kbytes)
[ 0.710000] 5 tp-link partitions found on MTD device spi0.0
[ 0.720000] Creating 5 MTD partitions on "spi0.0":
[ 0.720000] 0x000000000000-0x000000020000 : "u-boot"
[ 0.730000] 0x000000020000-0x00000015955c : "kernel"
[ 0.730000] 0x00000015955c-0x0000003f0000 : "rootfs"
[ 0.740000] mtd: device 2 (rootfs) set to be root filesystem
[ 0.740000] 1 squashfs-split partitions found on MTD device rootfs
[ 0.750000] 0x000000380000-0x0000003f0000 : "rootfs_data"
[ 0.760000] 0x0000003f0000-0x000000400000 : "art"
[ 0.760000] 0x000000020000-0x0000003f0000 : "firmware"
[ 0.790000] libphy: ag71xx_mdio: probed
[ 1.390000] ag71xx-mdio.1: Found an AR934X built-in switch
[ 1.430000] eth0: Atheros AG71xx at 0xba000000, irq 5, mode:GMII
[ 2.030000] ag71xx ag71xx.0: connected to PHY at ag71xx-mdio.1:04 [uid=004dd042, driver=Generic PHY]
[ 2.040000] eth1: Atheros AG71xx at 0xb9000000, irq 4, mode:MII
[ 2.050000] TCP: cubic registered
[ 2.050000] NET: Registered protocol family 10
[ 2.060000] NET: Registered protocol family 17
[ 2.060000] bridge: automatic filtering via arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter if you need this.
[ 2.070000] Bridge firewalling registered
[ 2.080000] 8021q: 802.1Q VLAN Support v1.8
[ 2.090000] VFS: Mounted root (squashfs filesystem) readonly on device 31:2.
[ 2.100000] Freeing unused kernel memory: 248K (803e2000 - 80420000)
[ 3.220000] init: Console is alive
[ 3.220000] init: - watchdog -
[ 5.270000] init: - preinit -
[ 5.890000] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 5.920000] random: procd urandom read with 9 bits of entropy available
[ 9.230000] jffs2: notice: (367) jffs2_build_xattr_subsystem: complete building xattr subsystem, 0 of xdatum (0 unchecked, 0 orphan) and 0 of xref (0 dead, 0 orphan) found.
[ 9.250000] mount_root: switching to jffs2 overlay
[ 9.450000] procd: - early -
[ 9.450000] procd: - watchdog -
[ 10.140000] procd: - ubus -
[ 11.160000] procd: - init -
[ 12.180000] l2tp_core: L2TP core driver, V2.0
[ 12.180000] l2tp_netlink: L2TP netlink interface
[ 12.190000] l2tp_eth: L2TP ethernet pseudowire support (L2TPv3)
[ 12.200000] l2tp_ip: L2TP IP encapsulation support (L2TPv3)
[ 12.210000] l2tp_ip6: L2TP IP encapsulation support for IPv6 (L2TPv3)
[ 12.230000] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 12.260000] Loading modules backported from Linux version wt-2016-06-20-0-gbc17424
[ 12.270000] Backport generated by backports.git backports-20160216-7-g5735958
[ 12.360000] batman_adv: B.A.T.M.A.N. advanced 2016.2 (compatibility version 15) loaded
[ 12.380000] u32 classifier
[ 12.390000] input device check on
[ 12.390000] Actions configured
[ 12.400000] Mirror/redirect action on
[ 12.520000] Ebtables v2.0 registered
[ 12.530000] ip_tables: (C) 2000-2006 Netfilter Core Team
[ 12.630000] nf_conntrack version 0.5.0 (441 buckets, 1764 max)
[ 12.710000] xt_time: kernel timezone is -0000
[ 12.770000] ath: EEPROM regdomain: 0x0
[ 12.770000] ath: EEPROM indicates default country code should be used
[ 12.770000] ath: doing EEPROM country->regdmn map search
[ 12.770000] ath: country maps to regdmn code: 0x3a
[ 12.770000] ath: Country alpha2 being used: US
[ 12.770000] ath: Regpair used: 0x3a
[ 12.780000] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
[ 12.790000] ieee80211 phy0: Atheros AR9531 Rev:2 mem=0xb8100000, irq=47
[ 23.580000] device eth0 entered promiscuous mode
[ 23.580000] IPv6: ADDRCONF(NETDEV_UP): br-mesh_lan: link is not ready
[ 23.600000] device eth1 entered promiscuous mode
[ 23.620000] br-wan: port 1(eth1) entered forwarding state
[ 23.620000] br-wan: port 1(eth1) entered forwarding state
[ 23.730000] IPv6: ADDRCONF(NETDEV_UP): br-client: link is not ready
[ 23.770000] device br-client entered promiscuous mode
[ 23.780000] IPv6: ADDRCONF(NETDEV_UP): local-node: link is not ready
[ 25.080000] br-wan: port 1(eth1) entered disabled state
[ 26.350000] ath: EEPROM regdomain: 0x8114
[ 26.350000] ath: EEPROM indicates we should expect a country code
[ 26.350000] ath: doing EEPROM country->regdmn map search
[ 26.350000] ath: country maps to regdmn code: 0x37
[ 26.350000] ath: Country alpha2 being used: DE
[ 26.350000] ath: Regpair used: 0x37
[ 26.350000] ath: regdomain 0x8114 dynamically updated by user
[ 26.900000] eth1: link up (100Mbps/Full duplex)
[ 26.900000] br-wan: port 1(eth1) entered forwarding state
[ 26.910000] br-wan: port 1(eth1) entered forwarding state
[ 27.920000] batman_adv: bat0: Adding interface: primary0
[ 27.920000] batman_adv: bat0: Interface activated: primary0
[ 27.930000] 8021q: adding VLAN 0 to HW filter on device bat0
[ 27.940000] batman_adv: bat0: Adding interface: br-wan
[ 27.950000] batman_adv: bat0: The MTU of interface br-wan is too small (1500) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting th
e MTU to 1532 would solve the problem.
[ 27.970000] batman_adv: bat0: Interface activated: br-wan
[ 28.040000] device bat0 entered promiscuous mode
[ 28.040000] br-client: port 1(bat0) entered forwarding state
[ 28.050000] br-client: port 1(bat0) entered forwarding state
[ 28.110000] IPv6: ADDRCONF(NETDEV_CHANGE): br-client: link becomes ready
[ 28.110000] IPv6: ADDRCONF(NETDEV_CHANGE): local-node: link becomes ready
[ 28.230000] batman_adv: bat0: Interface deactivated: br-wan
[ 28.240000] batman_adv: bat0: Removing interface: br-wan
[ 28.370000] batman_adv: bat0: Interface deactivated: primary0
[ 28.430000] batman_adv: bat0: Removing interface: primary0
[ 28.500000] br-client: port 1(bat0) entered disabled state
[ 28.560000] device bat0 left promiscuous mode
[ 28.560000] br-client: port 1(bat0) entered disabled state
[ 28.660000] device br-client left promiscuous mode
[ 28.910000] br-wan: port 1(eth1) entered forwarding state
[ 29.800000] IPv6: ADDRCONF(NETDEV_UP): client0: link is not ready
[ 29.860000] device client0 entered promiscuous mode
[ 30.210000] IPv6: ADDRCONF(NETDEV_CHANGE): client0: link becomes ready
[ 30.260000] batman_adv: bat0: Adding interface: primary0
[ 30.270000] batman_adv: bat0: Interface activated: primary0
[ 30.280000] 8021q: adding VLAN 0 to HW filter on device bat0
[ 30.330000] batman_adv: bat0: Adding interface: br-wan
[ 30.330000] batman_adv: bat0: The MTU of interface br-wan is too small (1500) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting th
e MTU to 1532 would solve the problem.
[ 30.360000] batman_adv: bat0: Interface activated: br-wan
[ 30.380000] IPv6: ADDRCONF(NETDEV_UP): ibss0: link is not ready
[ 30.440000] ibss0: Created IBSS using preconfigured BSSID b2:ca:ff:ee:ba:be
[ 30.450000] ibss0: Creating new IBSS network, BSSID b2:ca:ff:ee:ba:be
[ 30.460000] IPv6: ADDRCONF(NETDEV_CHANGE): ibss0: link becomes ready
[ 32.460000] batman_adv: bat0: Adding interface: ibss0.2
[ 32.460000] batman_adv: bat0: Interface activated: ibss0.2
[ 33.980000] batman_adv: bat0: Changing gw mode from: off to: client
[ 34.010000] batman_adv: bat0: hop_penalty: Changing from: 30 to: 15
[ 34.010000] batman_adv: bat0: multicast_mode: Changing from: enabled to: disabled
[ 34.040000] batman_adv: bat0: orig_interval: Changing from: 1000 to: 5000
[ 35.110000] random: nonblocking pool is initialized
Might commit d452a7c2cf1c0da4e034666a50dc0e7aa9ddc592 fix that issue?
@kpanic23 No, that one just caused kernel crashs. I think this issue might be a netifd bug, possibly the same as #905.
Please test with the current master (e45c30330d2eede2cec39c3082ad5e6bdc40a6ba or later), I've refactored the batman-adv interface management to make it much more robust.
@NeoRaider Nailed it! That really seems to have fixed the problem. At the moment we have 24 nodes, that had been constantly locking up before, running a current master (ad9878b), and we have had not a single lockup so far. A few of the nodes, e.g. https://map.freifunk-3laendereck.net/#!v:m;n:60e327e756ce are rebooting by themselves once or twice a day, but that's waaaay better than locking up. Actually I'm thinking about signing that master as new stable firmware ;)
We are noticing strange network lockups on our nodes. The nodes somehow lose network connection on layer 3, are not pingable, no ssh, nothing. But they can still be pinged via "batctl ping" on layer 2. There seems to be no system behind which node is affected, the amount of clients, traffic, everything seems to be quite random. One of my own nodes at home was affected twice, with two days in between. I just cycled the power both times, and have never had a problem since and before. Strangely, I have seen NTP packets from these nodes, while being affected, to our NTP servers. The affected nodes appear in the map (hopglass) as online, but no message since X. They get listed in the list of disappeared nodes, but with green color. On the map itself, they are shown twice, once online and once offline.
Since there is no chance to connect to those nodes remotely, neither of them have a serial console attached and one can't anticipate which node is affected next, there is unfortunatelly no chance of getting any logs. We have noticed this behaviour since January, when we first experimented with hopglass. On our old alfred-based meshviewer map they were listed as plain offline, so it never occured to us they were still pingable vie batctl. So I suppose this problem exists quite a bit longer.