canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.35k stars 930 forks source link

lxd-ci: Debug why `tests/network-ovn` peering test fails on GHA runners but succeeds locally #13069

Open simondeziel opened 7 months ago

simondeziel commented 7 months ago

PURGE_LXD=1 ./bin/local-run tests/network-ovn latest/edge peering works locally even when installing the 6.5 Azure kernel in a LXD VM. This however consistently fails on GHA runners at this point:

    echo "==> Test that pinging external addresses between networks does worth without peering (goes via uplink)"
    lxc exec ovn2 --project=ovn2 -- ping -nc1 -4 -w5 198.51.100.2
    lxc exec ovn2 --project=ovn2 -- ping -nc1 -6 -w5 2001:db8:1:2::2

I captured some information from a tmate debug session:

root@fv-az520-983:~# lxc ls --all-projects
+---------+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| PROJECT | NAME |  STATE  |        IPV4         |                     IPV6                      |   TYPE    | SNAPSHOTS |
+---------+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| ovn1    | ovn1 | RUNNING | 198.51.100.2 (eth0) | fd42:a663:6118:2961:216:3eff:fed0:54b9 (eth0) | CONTAINER | 0         |
|         |      |         | 198.51.100.1 (eth0) | 2001:db8:1:2::2 (eth0)                        |           |           |
|         |      |         | 10.153.20.2 (eth0)  | 2001:db8:1:2::1 (eth0)                        |           |           |
+---------+------+---------+---------------------+-----------------------------------------------+-----------+-----------+
| ovn2    | ovn2 | RUNNING | 10.143.162.2 (eth0) | fd42:489c:e694:94ea:216:3eff:fec1:59b3 (eth0) | CONTAINER | 0         |
+---------+------+---------+---------------------+-----------------------------------------------+-----------+-----------+

root@fv-az520-983:~# lxc network ls --project ovn1
+------+------+---------+----------------+---------------------------+-------------+---------+---------+
| NAME | TYPE | MANAGED |      IPV4      |           IPV6            | DESCRIPTION | USED BY |  STATE  |
+------+------+---------+----------------+---------------------------+-------------+---------+---------+
| ovn1 | ovn  | YES     | 10.153.20.1/24 | fd42:a663:6118:2961::1/64 |             | 1       | CREATED |
+------+------+---------+----------------+---------------------------+-------------+---------+---------+
root@fv-az520-983:~# lxc network ls --project ovn2
+------+------+---------+-----------------+---------------------------+-------------+---------+---------+
| NAME | TYPE | MANAGED |      IPV4       |           IPV6            | DESCRIPTION | USED BY |  STATE  |
+------+------+---------+-----------------+---------------------------+-------------+---------+---------+
| ovn2 | ovn  | YES     | 10.143.162.1/24 | fd42:489c:e694:94ea::1/64 |             | 0       | CREATED |
+------+------+---------+-----------------+---------------------------+-------------+---------+---------+

root@fv-az520-983:~# lxc network peer list ovn1 --project ovn1
+------+-------------+------+-------+
| NAME | DESCRIPTION | PEER | STATE |
+------+-------------+------+-------+
root@fv-az520-983:~# lxc network peer list ovn2 --project ovn2
+---------+-------------+---------+---------+
|  NAME   | DESCRIPTION |  PEER   |  STATE  |
+---------+-------------+---------+---------+
| ovn2foo |             | Unknown | ERRORED |
+---------+-------------+---------+---------+

The firewall rules look OK except for that unusual security table one:

 + nft list ruleset
table inet lxd {
    chain pstrt.lxdbr0 {
        type nat hook postrouting priority srcnat; policy accept;
        ip saddr 10.10.10.0/24 ip daddr != 10.10.10.0/24 masquerade
        ip6 saddr fd42:4242:4242:1010::/64 ip6 daddr != fd42:4242:4242:1010::/64 masquerade
    }

    chain fwd.lxdbr0 {
        type filter hook forward priority filter; policy accept;
        ip version 4 oifname "lxdbr0" accept
        ip version 4 iifname "lxdbr0" accept
        ip6 version 6 oifname "lxdbr0" accept
        ip6 version 6 iifname "lxdbr0" accept
    }

    chain in.lxdbr0 {
        type filter hook input priority filter; policy accept;
        iifname "lxdbr0" tcp dport 53 accept
        iifname "lxdbr0" udp dport 53 accept
        iifname "lxdbr0" icmp type { destination-unreachable, time-exceeded, parameter-problem } accept
        iifname "lxdbr0" udp dport 67 accept
        iifname "lxdbr0" icmpv6 type { destination-unreachable, packet-too-big, time-exceeded, parameter-problem, nd-router-solicit, nd-neighbor-solicit, nd-neighbor-advert, mld2-listener-report } accept
        iifname "lxdbr0" udp dport 547 accept
    }

    chain out.lxdbr0 {
        type filter hook output priority filter; policy accept;
        oifname "lxdbr0" tcp sport 53 accept
        oifname "lxdbr0" udp sport 53 accept
        oifname "lxdbr0" icmp type { destination-unreachable, time-exceeded, parameter-problem } accept
        oifname "lxdbr0" udp sport 67 accept
        oifname "lxdbr0" icmpv6 type { destination-unreachable, packet-too-big, time-exceeded, parameter-problem, echo-request, nd-router-advert, nd-neighbor-solicit, nd-neighbor-advert, mld2-listener-report } accept
        oifname "lxdbr0" udp sport 547 accept
    }
}
table ip security {
    chain OUTPUT {
        type filter hook output priority 150; policy accept;
        meta l4proto tcp ip daddr 168.63.129.16 tcp dport 53 counter packets 0 bytes 0 accept
        meta l4proto tcp ip daddr 168.63.129.16 skuid 0 counter packets 1446 bytes 405667 accept
        meta l4proto tcp ip daddr 168.63.129.16 ct state invalid,new counter packets 0 bytes 0 drop
    }
}

However, deleting it didn't help, the pings still don't go through.

Switch from ping to nc -zv ... 22 doesn't help either, same timeout.

The azure.pcap.gz gz compressed pcap shows that at some point, the echo reply is just vanishing. This contrasts with a capture from a local VM (with the Azure kernel): local.pcap.gz

simondeziel commented 7 months ago

@tomponline any idea as to what's going on here? Or what I could try next? If not, my next move will be to try on Canonical-hosted runners.

tomponline commented 7 months ago

@simondeziel how do I get the test scripts to stop destroying the env when i run it manually?

I need to the env left in the same state as it was when the failure occurs.

simondeziel commented 7 months ago

@tomponline the cleanup handling is reworked in https://github.com/canonical/lxd-ci/pull/94

tomponline commented 7 months ago

Here's something funny, running sudo tcpdump -nn -i lxdbr0 (which switches the bridge into promiscuous mode) makes it work, and exiting tcpdump breaks it again :)

tomponline commented 7 months ago

lxc network set lxdbr0 ipv4.nat=false fixes it.

tomponline commented 7 months ago

Considering whether we should alter the SNAT rule such that it only applied to traffic leaving the bridge via a non-bridge interface, e.g.

nft add rule inet lxd pstrt.lxdbr0 ip saddr 10.10.10.0/24 ip daddr != 10.10.10.0/24 oif != lxdbr0 masquerade

As this fixes it also by not performing SNAT between intra-network traffic when the source address doesn't match that of the main network.

simondeziel commented 21 hours ago

https://github.com/canonical/lxd-ci/actions/runs/11300430475/job/31433290050?pr=311#step:10:3879 has the rmmod br_netfilter workaround but failed nevertheless:

==> Ping internal and external NIC route addresses over peer connection
+ lxc exec ovn2 --project=ovn2 -- ping -nc1 -4 -w5 198.51.100.1
PING 198.51.100.1 (198.51.100.1) 56(84) bytes of data.

--- 198.51.100.1 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4078ms

That said, tests/network-ovn "feels" more reliable... for what it's worth.