Flaky OVS related e2e tests on the dedicated flexible IPAM testbed

Following test cases on the dedicated flexible IPAM testbed failed for patch release 2.0.1 in two different builds:

    --- FAIL: TestConnectivity/testOVSFlowReplay (13.23s)
    --- FAIL: TestAntreaIPAM/testAntreaIPAMOVSRestartSameNode (57.44s)

Output are like below, need to check if there is a way to improve the e2e robustness.

=== RUN   TestConnectivity/testOVSFlowReplay
    connectivity_test.go:401: Creating 2 toolbox test Pods on 'antrea-ipam-ds-0-1'
    connectivity_test.go:76: Waiting for Pods to be ready and retrieving IPs
    connectivity_test.go:90: Retrieved all Pod IPs: map[test-pod-0-xtevxzsj:IPv4(192.168.249.214),IPstrings(192.168.249.214) test-pod-1-8hl08o1o:IPv4(192.168.249.215),IPstrings(192.168.249.215)]
    connectivity_test.go:101: Ping mesh test between all Pods
    connectivity_test.go:118: Ping 'testconnectivity-vxj2py28/test-pod-0-xtevxzsj' -> 'testconnectivity-vxj2py28/test-pod-1-8hl08o1o': OK
    connectivity_test.go:118: Ping 'testconnectivity-vxj2py28/test-pod-1-8hl08o1o' -> 'testconnectivity-vxj2py28/test-pod-0-xtevxzsj': OK
    connectivity_test.go:417: The Antrea Pod for Node 'antrea-ipam-ds-0-1' is 'antrea-agent-49ccx'
    connectivity_test.go:434: Counted 153 flow in OVS bridge 'br-int' for Node 'antrea-ipam-ds-0-1'
    connectivity_test.go:453: Counted 13 group in OVS bridge 'br-int' for Node 'antrea-ipam-ds-0-1'
    connectivity_test.go:462: Deleting flows / groups and restarting OVS daemons on Node 'antrea-ipam-ds-0-1'
    connectivity_test.go:484: Error when restarting OVS with ovs-ctl: command terminated with exit code 137 - stdout:  * Saving flows
         * Exiting ovsdb-server (46)
         - stderr: 
    fixtures.go:531: Deleting Pod 'test-pod-1-8hl08o1o'
    fixtures.go:531: Deleting Pod 'test-pod-0-xtevxzsj'

=== RUN   TestAntreaIPAM/testAntreaIPAMOVSRestartSameNode
    connectivity_test.go:337: Creating two toolbox test Pods on 'antrea-ipam-ds-0-1'
    fixtures.go:579: Creating a test Pod 'test-pod-cbshe5lp' and waiting for IP
    fixtures.go:579: Creating a test Pod 'test-pod-tx1u5na4' and waiting for IP
    connectivity_test.go:378: Restarting antrea-agent on Node 'antrea-ipam-ds-0-1'
    connectivity_test.go:359: Arping loss rate: 12.000000%
    connectivity_test.go:366: ARPING 192.168.240.101
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=0 time=435.896 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=1 time=5.171 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=2 time=5.139 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=3 time=4.372 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=4 time=11.172 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=5 time=5.443 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=6 time=5.019 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=7 time=5.078 usec
        Timeout
        Timeout
        Timeout
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=8 time=246.665 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=9 time=4.870 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=10 time=5.219 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=11 time=5.511 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=12 time=4.585 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=13 time=4.923 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=14 time=4.654 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=15 time=4.562 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=16 time=4.628 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=17 time=4.466 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=18 time=5.012 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=19 time=5.201 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=20 time=5.826 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=21 time=4.985 usec

        --- 192.168.240.101 statistics ---
        25 packets transmitted, 22 packets received,  12%!u(MISSING)nanswered (0 extra)
        rtt min/avg/max/std-dev = 0.004/0.036/0.436/0.101 ms
    connectivity_test.go:384: Arping test failed: arping loss rate is 12.000000%
    fixtures.go:531: Deleting Pod 'test-pod-tx1u5na4'
    fixtures.go:531: Deleting Pod 'test-pod-cbshe5lp'
    connectivity_test.go:337: Creating two toolbox test Pods on 'antrea-ipam-ds-0-1'
    fixtures.go:579: Creating a test Pod 'test-pod-pqk2v325' and waiting for IP
    fixtures.go:579: Creating a test Pod 'test-pod-rm7xh2oj' and waiting for IP
    connectivity_test.go:378: Restarting antrea-agent on Node 'antrea-ipam-ds-0-1'
    connectivity_test.go:359: Arping loss rate: 8.000000%
    fixtures.go:531: Deleting Pod 'test-pod-pqk2v325'
    fixtures.go:531: Deleting Pod 'test-pod-rm7xh2oj'
=== RUN   TestAntreaIPAM/testAntreaIPAMOVSFlowReplay

@luolanzone I think you can ignore these failures for the patch release. AFAIK, we didn't backport anything related to this to the release-2.0 branch. We did have some similar failures for the main branch (not specific to FlexibleIPAM), with the following related changes: 1) #5777 delayed realization of the Pod network on Agent start, without changing the logic for removing flow-restore-wait. At the time we didn't observe any failure because all e2e testing was using the coverage image, which was flawed in a way that was hiding the issue. This change is part of release-2.0. 2) #6090 improved code coverage collection and the flaw that existed in the coverage image was removed. After merging this change, we started observing failures for testOVSRestartSameNode. 3) #6342 resolved the issue by correctly delaying the removal of flow-restore-wait in the bridge.

While the issue caused by #5777 should affect testing for the release-2.0 branch, in practice we should not be observing test failures as long as we are using a coverage-enabled image using bincover. So maybe you could double check that the FlexibleIPAM e2e tests are using the correct image - or rather images since we have one for the Agent and one for the Controller?

We cannot backport #6090 and #6342 to release-2.0 as these are pretty significant changes, which should not go into a patch release. If you do observe similar test failures for the main branch, then we would have to look into it. BTW, these failures should also exist for release-1.15 and the upcoming 1.15.2 patch release, because #5777 is also part of the release-1.15 branch.

antrea-io / antrea

Flaky OVS related e2e tests on the dedicated flexible IPAM testbed #6458