confidential-containers / cloud-api-adaptor

Ability to create Kata pods using cloud provider APIs aka the peer-pods approach
Apache License 2.0
48 stars 88 forks source link

securecomms: TestSshProxyReverse test is flakey #2153

Closed stevenhorsman closed 1 day ago

stevenhorsman commented 1 week ago

In the last few weeks I have failing often seen the CAA build job failed e.g. https://github.com/confidential-containers/cloud-api-adaptor/actions/runs/11819549925/job/32929722455?pr=2146 due to an issue in the TestSshProxyReverse test:

=== RUN   TestSshProxyReverse
2024/11/13 14:44:54 [secure-comms] Attestation phase: peer reported phase Attestation
2024/11/13 14:44:54 [secure-comms] Attestation phase: peer reported phase Attestation
2024/11/13 14:44:54 [secure-comms] Inbound listening to port 7011 in namespace 
2024/11/13 14:44:54 [secure-comms] Attestation phase: AddInbound: XYZ
HttpClient start : http://127.0.0.1:7011/
HttpClient sending req: http://127.0.0.1:7011/
2024/11/13 14:44:54 [secure-comms] Attestation phase: Inbound accept: XYZ
2024/11/13 14:44:54 [secure-comms] Attestation phase: NewSshPeer - peer requested a tunnel channel for XYZ
2024/11/13 14:44:54 [secure-comms] Attestation phase: NewInboundInstance OpenChannel opening tunnel for: XYZ
2024/11/13 14:44:54 [secure-comms] Outbound XYZ accept dial address 127.0.0.1:7001 err: dial tcp 127.0.0.1:7001: connect: connection refused - closing channel
2024/11/13 14:44:54 [secure-comms] Attestation phase: Inbound XYZ channelReqs closed
HttpClient http://127.0.0.1:7011/ Error Get "http://127.0.0.1:7011/": EOF
    sshproxy_test.go:230: Failed - not successful
2024/11/13 14:44:54 [secure-comms] Attestation phase: peer reported it is upgrading to Kubernetes phase
2024/11/13 14:44:54 [secure-comms] Attestation phase: peer done by >>> Test Finish <<<
2024/11/13 14:44:54 [secure-comms] Attestation phase: peer done by >>> chans closed <<<
--- FAIL: TestSshProxyReverse (1.63s)

and in https://github.com/confidential-containers/cloud-api-adaptor/actions/runs/11819549925/job/32930561772?pr=2146

=== RUN   TestSshProxyWithNamespace
2024/11/13 14:58:28 [secure-comms] Attestation phase: peer reported phase Attestation
2024/11/13 14:58:28 [secure-comms] Attestation phase: peer reported phase Attestation
2024/11/13 14:58:28 [secure-comms] Inbound listening to port 7010 in namespace a64934bc-7bb7-4e83-8ad7-794edd3a1f55
2024/11/13 14:58:28 [secure-comms] Attestation phase: AddInbound: ABC
HttpClient start : http://127.0.0.1:7010/ in namepspace: /run/netns/a64934bc-7bb7-4e83-8ad7-794edd3a1f55
HttpClient dialing req: 127.0.0.1:7010 in namepspace: /run/netns/a64934bc-7bb7-4e83-8ad7-794edd3a1f55
2024/11/13 14:58:28 [secure-comms] Attestation phase: Inbound accept: ABC
2024/11/13 14:58:28 [secure-comms] Attestation phase: NewInboundInstance OpenChannel opening tunnel for: ABC
2024/11/13 14:58:28 [secure-comms] Attestation phase: NewSshPeer - peer requested a tunnel channel for ABC
2024/11/13 14:58:28 [secure-comms] Outbound ABC accept dial address 127.0.0.1:7020 err: dial tcp 127.0.0.1:7020: connect: connection refused - closing channel
2024/11/13 14:58:28 [secure-comms] Attestation phase: Inbound ABC channelReqs closed
HttpClient http://127.0.0.1:7010/ Get Error Get "http://127.0.0.1:7010/": EOF
    sshproxy_test.go:187: Failed - not successful
--- FAIL: TestSshProxyWithNamespace (2.23s)

These are starting to slow down PRs as multiple re-test are sometimes needed, so we need to have these tests fixed.