Open simondeziel opened 7 months ago
@tomponline any idea as to what's going on here? Or what I could try next? If not, my next move will be to try on Canonical-hosted runners.
@simondeziel how do I get the test scripts to stop destroying the env when i run it manually?
I need to the env left in the same state as it was when the failure occurs.
@tomponline the cleanup handling is reworked in https://github.com/canonical/lxd-ci/pull/94
Here's something funny, running sudo tcpdump -nn -i lxdbr0
(which switches the bridge into promiscuous mode) makes it work, and exiting tcpdump breaks it again :)
lxc network set lxdbr0 ipv4.nat=false
fixes it.
Considering whether we should alter the SNAT rule such that it only applied to traffic leaving the bridge via a non-bridge interface, e.g.
nft add rule inet lxd pstrt.lxdbr0 ip saddr 10.10.10.0/24 ip daddr != 10.10.10.0/24 oif != lxdbr0 masquerade
As this fixes it also by not performing SNAT between intra-network traffic when the source address doesn't match that of the main network.
https://github.com/canonical/lxd-ci/actions/runs/11300430475/job/31433290050?pr=311#step:10:3879 has the rmmod br_netfilter
workaround but failed nevertheless:
==> Ping internal and external NIC route addresses over peer connection
+ lxc exec ovn2 --project=ovn2 -- ping -nc1 -4 -w5 198.51.100.1
PING 198.51.100.1 (198.51.100.1) 56(84) bytes of data.
--- 198.51.100.1 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4078ms
That said, tests/network-ovn
"feels" more reliable... for what it's worth.
PURGE_LXD=1 ./bin/local-run tests/network-ovn latest/edge peering
works locally even when installing the 6.5 Azure kernel in a LXD VM. This however consistently fails on GHA runners at this point:I captured some information from a tmate debug session:
The firewall rules look OK except for that unusual
security
table one:However, deleting it didn't help, the pings still don't go through.
Switch from ping to
nc -zv ... 22
doesn't help either, same timeout.The azure.pcap.gz gz compressed pcap shows that at some point, the echo reply is just vanishing. This contrasts with a capture from a local VM (with the Azure kernel): local.pcap.gz