fabianishere / udm-kernel

Custom Linux kernels for the UniFi Dream Machine
https://github.com/fabianishere/udm-kernel-tools
Other
127 stars 10 forks source link

Adoption fails? #26

Open issmirnov opened 1 year ago

issmirnov commented 1 year ago

Hey @fabianishere ,

I installed the kernel and was able to get BGP multipath routing working - yay!

However, overnight my devices all rebooted after an upgrade and entered a repeated failed adoption loop. Truee the usual set-inform and reboots, but nothing helped.

After reverting to the stock kernel and rebooting, adoption worked on the whole network again.

Any thoughts? Ideally I need both the multipath but also my fleet to stay up 😄

fabianishere commented 1 year ago

Can you SSH into the UDM (Pro)? It would be useful to share the kernel log (dmesg) and UniFi log (cat /var/log/messages).

issmirnov commented 1 year ago

Will do once I get home. This morning I saw a ton of error messages complaining about multipath, and DHCP error logs.

issmirnov commented 1 year ago

This is with the edge kernel:

/var/log/messages

...
Jan 10 16:03:05 UDM-Pro user.warn ubios-udapi-server: netlink: Multipath routes not supported, got 3 nexthops for route 10.5.0.10/32 via 10.3.34.201 dev br0
Jan 10 16:03:05 UDM-Pro user.warn kernel: [   89.072888] [LAN_IN-D-2006] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:7c:d9:5c:2a:04:05:08:00 SRC=192.168.6.39 DST=10.3.33.1 LEN=396 TOS=0x00 PREC=0x00 TTL=63 ID=23536 PROTO=TCP SPT=8009 DPT=44648 WINDOW=4096 RES=0x00 ACK PSH URGP=0 
Jan 10 16:03:05 UDM-Pro user.warn kernel: [   89.141830] [LAN_IN-D-2006] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:7c:d9:5c:2a:04:05:08:00 SRC=192.168.6.39 DST=10.3.33.228 LEN=396 TOS=0x00 PREC=0x00 TTL=63 ID=18633 PROTO=TCP SPT=8009 DPT=54125 WINDOW=4096 RES=0x00 ACK PSH URGP=0 
Jan 10 16:03:05 UDM-Pro user.warn kernel: [   89.283025] [LAN_IN-D-2006] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:1c:f2:9a:0f:27:e9:08:00 SRC=192.168.6.122 DST=10.3.33.1 LEN=397 TOS=0x00 PREC=0x00 TTL=63 ID=6893 PROTO=TCP SPT=8009 DPT=58554 WINDOW=4096 RES=0x00 ACK PSH URGP=0 
Jan 10 16:03:06 UDM-Pro user.warn ubios-udapi-server: netlink: Multipath routes not supported, got 2 nexthops for route 10.50.0.15/32 via 192.168.4.22 dev br5
Jan 10 16:03:06 UDM-Pro user.warn ubios-udapi-server: netlink: Multipath routes not supported, got 2 nexthops for route 10.50.0.11/32 via 192.168.4.22 dev br5
Jan 10 16:03:06 UDM-Pro user.warn ubios-udapi-server: netlink: Multipath routes not supported, got 2 nexthops for route 10.50.0.14/32 via 192.168.4.21 dev br5
Jan 10 16:03:06 UDM-Pro user.warn ubios-udapi-server: netlink: Multipath routes not supported, got 3 nexthops for route 10.50.0.12/32 via 192.168.4.21 dev br5
Jan 10 16:03:06 UDM-Pro user.warn ubios-udapi-server: netlink: Multipath routes not supported, got 2 nexthops for route 10.5.0.13/32 via 10.3.34.202 dev br0
Jan 10 16:03:06 UDM-Pro user.warn ubios-udapi-server: netlink: Multipath routes not supported, got 2 nexthops for route 10.5.0.12/32 via 10.3.34.202 dev br0
Jan 10 16:03:06 UDM-Pro user.warn ubios-udapi-server: netlink: Multipath routes not supported, got 2 nexthops for route 10.5.0.14/32 via 10.3.34.202 dev br0
Jan 10 16:03:06 UDM-Pro user.warn ubios-udapi-server: netlink: Multipath routes not supported, got 3 nexthops for route 10.5.0.10/32 via 10.3.34.201 dev br0
Jan 10 16:03:06 UDM-Pro daemon.warn dnsmasq-dhcp[3156]: no address range available for DHCP request via br0

dmesg:

[  101.938051] [LAN_IN-RET-2005] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:48:d6:d5:7b:57:84:08:00 SRC=192.168.6.55 DST=10.3.37.71 LEN=1343 TOS=0x00 PREC=0x00 TTL=63 ID=11838 DF PROTO=TCP SPT=32244 DPT=55232 WINDOW=538 RES=0x00 ACK PSH URGP=0 
[  102.110401] [LAN_IN-D-2006] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:1c:f2:9a:0f:27:e9:08:00 SRC=192.168.6.122 DST=10.3.33.228 LEN=397 TOS=0x00 PREC=0x00 TTL=63 ID=46966 PROTO=TCP SPT=8009 DPT=54126 WINDOW=4096 RES=0x00 ACK PSH URGP=0 
[  102.314922] [LAN_IN-RET-2005] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:48:d6:d5:7b:57:84:08:00 SRC=192.168.6.55 DST=10.3.37.71 LEN=422 TOS=0x00 PREC=0x00 TTL=63 ID=11839 DF PROTO=TCP SPT=32244 DPT=55232 WINDOW=538 RES=0x00 ACK PSH URGP=0 
[  106.823440] [LAN_IN-RET-2005] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:7c:d9:5c:2a:04:05:08:00 SRC=192.168.6.39 DST=10.3.37.71 LEN=52 TOS=0x00 PREC=0x00 TTL=63 ID=65085 PROTO=TCP SPT=8009 DPT=33110 WINDOW=4095 RES=0x00 ACK URGP=0 
[  106.825171] [LAN_IN-RET-2005] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:7c:d9:5c:2a:04:05:08:00 SRC=192.168.6.39 DST=10.3.37.71 LEN=162 TOS=0x00 PREC=0x00 TTL=63 ID=65086 PROTO=TCP SPT=8009 DPT=33110 WINDOW=4096 RES=0x00 ACK PSH URGP=0 
[  111.855204] [LAN_IN-RET-2005] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:1c:f2:9a:0f:27:e9:08:00 SRC=192.168.6.122 DST=10.3.37.71 LEN=52 TOS=0x00 PREC=0x00 TTL=63 ID=42073 PROTO=TCP SPT=8009 DPT=55726 WINDOW=4095 RES=0x00 ACK URGP=0 
[  111.856091] [LAN_IN-RET-2005] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:1c:f2:9a:0f:27:e9:08:00 SRC=192.168.6.122 DST=10.3.37.71 LEN=162 TOS=0x00 PREC=0x00 TTL=63 ID=42074 PROTO=TCP SPT=8009 DPT=55726 WINDOW=4096 RES=0x00 ACK PSH URGP=0 
[  111.867595] [LAN_IN-RET-2005] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:48:d6:d5:7b:57:84:08:00 SRC=192.168.6.55 DST=10.3.37.71 LEN=162 TOS=0x00 PREC=0x00 TTL=63 ID=8444 DF PROTO=TCP SPT=8009 DPT=51754 WINDOW=488 RES=0x00 ACK PSH URGP=0 
[  112.330584] [LAN_IN-RET-2005] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:48:d6:d5:7b:57:84:08:00 SRC=192.168.6.55 DST=10.3.37.71 LEN=162 TOS=0x00 PREC=0x00 TTL=63 ID=11840 DF PROTO=TCP SPT=32244 DPT=55232 WINDOW=560 RES=0x00 ACK PSH URGP=0 
[  114.182943] [LAN_IN-D-2006] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:48:d6:d5:7b:57:84:08:00 SRC=192.168.6.55 DST=10.3.37.190 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=16460 DF PROTO=TCP SPT=35722 DPT=9000 WINDOW=29200 RES=0x00 SYN URGP=0 
[  114.203181] [LAN_IN-D-2006] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:48:d6:d5:7b:57:84:08:00 SRC=192.168.6.55 DST=10.3.33.228 LEN=1241 TOS=0x00 PREC=0x00 TTL=63 ID=50130 DF PROTO=TCP SPT=32244 DPT=54341 WINDOW=503 RES=0x00 ACK PSH URGP=0 
[  115.172967] [LAN_IN-D-2006] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:48:d6:d5:7b:57:84:08:00 SRC=192.168.6.55 DST=10.3.37.190 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=16461 DF PROTO=TCP SPT=35722 DPT=9000 WINDOW=29200 RES=0x00 SYN URGP=0 
[  116.872963] [LAN_IN-RET-2005] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:7c:d9:5c:2a:04:05:08:00 SRC=192.168.6.39 DST=10.3.37.71 LEN=52 TOS=0x00 PREC=0x00 TTL=63 ID=65087 PROTO=TCP SPT=8009 DPT=33110 WINDOW=4095 RES=0x00 ACK URGP=0 
[  116.873628] [LAN_IN-RET-2005] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:7c:d9:5c:2a:04:05:08:00 SRC=192.168.6.39 DST=10.3.37.71 LEN=162 TOS=0x00 PREC=0x00 TTL=63 ID=65088 PROTO=TCP SPT=8009 DPT=33110 WINDOW=4096 RES=0x00 ACK PSH URGP=0 
[  117.179111] [LAN_IN-D-2006] IN=br3 OUT=br0 MAC=78:45:58:86:50:df:48:d6:d5:7b:57:84:08:00 SRC=192.168.6.55 DST=10.3.37.190 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=16462 DF PROTO=TCP SPT=35722 DPT=9000 WINDOW=29200 RES=0x00 SYN URGP=0

Immediately after loading the edge kernel, all devices got un-adopted.

image

fabianishere commented 1 year ago

Are the devices reachable from SSH? I guess the UniFi controller is not liking the multipath routes (and might be misconfiguring your router).

issmirnov commented 1 year ago

Agreed.

no, the devices are not reachable. Interesting to see that on the edge kernel, there's a route added to the top of the routing table:

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         67.190.92.1     0.0.0.0         UG    0      0        0 eth7

That's quite a strange one, not sure why it's happening.

Here's a full dump of configs: https://gist.github.com/issmirnov/cb467b4aee42442b8b734e77fdf1959f

fabianishere commented 1 year ago

You have no route for 192.168.1.0/24 (I would expect it to be there), so it is going to your default gateway. Also, any idea why your default route is through eth7 and not through the UDM's WAN ports (eth8 or eth9)?

issmirnov commented 1 year ago

I didn't change the default route settings, so I'm not sure why it's uring eth7. My uplink is connected to port 9, with a failover WAN2 uplink on port 8. I'm using the SFP connector (port 11) to connect to my 10gbs backbone internally.

I could try adding a simple on-boot patch to the custom kernel to add the route. FWIW, I don't see the route for 192.168.1.0/24 in the stock kernel route -n output, even though adoption works fine.

I am curious though why simply switching the kernel changes the routing table so much.

And by the way, thank you so much for your responsiveness! I really do appreciate it.

fabianishere commented 1 year ago

Could you perhaps share the output of ip route and ip a as well?

issmirnov commented 1 year ago

Here are the latest: https://gist.github.com/issmirnov/3f62343e8221204402d85580b3b9b364

From what I can tell, both on edge and on stock kernel the outputs are identical, although the issue is 100% reproducible.

fabianishere commented 1 year ago

I guess the default route (default via 67.190.92.1 dev eth7 proto dhcp) on the edge kernel is causing issues. Could you try removing it from SSH to see if it resolves the issue:

ip route delete default via 67.190.92.1 dev eth7
issmirnov commented 1 year ago

Here's what it looks like on the edge kernel.

# ip route delete default via 67.190.92.1 dev eth7
# ip route get 192.168.1.20
192.168.1.20 via 192.168.0.1 dev eth8 table 201 src 192.168.0.4 uid 0 
    cache 
# ping 192.168.1.20
PING 192.168.1.20 (192.168.1.20): 56 data bytes
^C
--- 192.168.1.20 ping statistics ---
7 packets transmitted, 0 packets received, 100% packet los

I tried messing around with various command like ip route add 192.168.1.0/24 via 0.0.0.0 interface br0, but I'm not sure what the default management bridge interface looks like so I was never able to get SSH working.