[sig-apps] Conformance tests being skipped.

jayunit100 commented 4 years ago

Describe the bug

Looks like upstream Conformance tests suites will need to get some looking into - some are failing (most pass) when running the full suite.

The reason this might be new failures is that we i guess are skipping sig-apps tests in CI...?

[Fail] [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] [It] should perform rolling updates and roll backs of template modifications [Conformance]
/workspace/anago-v1.18.2-beta.0.14+a78cd082e8c913/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/statefulset/wait.go:74

[Fail] [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] [It] should perform canary updates and phased rolling updates of template modifications [Conformance]
/workspace/anago-v1.18.2-beta.0.14+a78cd082e8c913/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/statefulset/wait.go:58

[Fail] [k8s.io] Container Runtime blackbox test when running a container with a new image [It] should be able to pull from private registry with secret [NodeConformance]
/workspace/anago-v1.18.2-beta.0.14+a78cd082e8c913/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/runtime.go:366

This one seems like it might be flakey in certain antrea clusters, but not sure.

[Fail] [sig-cli] Kubectl client Guestbook application [It] should create and stop a working application  [Conformance]
/workspace/anago-v1.18.2-beta.0.14+a78cd082e8c913/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl/kubectl.go:1863

To Reproduce

sonobuoy run --e2e-focus "Conformance" --wait=600 --plugin e2e

Expected

Conformance should pass :)

Actual behavior

The above 4 tests pass.

Versions: 0.7.0

rosskukulinski commented 4 years ago

https://github.com/cncf/k8s-conformance/blob/master/instructions.md

Deploy a Sonobuoy pod to your cluster with:

$ sonobuoy run --mode=certified-conformance

NOTE: The --mode=certified-conformance flag is required for certification runs since Kubernetes v1.16 (and Sonobuoy v0.16). Without this flag, tests which may be disruptive to your other workloads may be skipped. A valid certification run may not skip any conformance tests. If you're setting the test focus/skip values manually, certification runs require E2E_FOCUS=[Conformance] and no value for E2E_SKIP.

jayunit100 commented 4 years ago

so , now these tests are passing for me :) , it might have been flakiness. [[[ EDIT, see below, this was 3/4 actually still fail ]]]

rosskukulinski commented 4 years ago

Regardless, I think the antrea e2e test suite should be validating against a complete certified-conformance test run. @McCodeman

jayunit100 commented 4 years ago

yup, ill be working with @antoninbas to figure out why we skip the others and get this sorted, will spend more time characterizing these failures and confirming wether they are related to my infra, antrea , or both .... have kicked off a new full Test with new infra, new CIDRs, and so on, will see what the results are.

antoninbas commented 4 years ago

We ran a select subset of conformance tests for every PR, and we try to keep the time it takes to run the tests to around 20 minutes. If running the full suite takes much longer on our infrastructure, we will probably not run it for every PR, but we can add a separate daily Jenkins job.

jayunit100 commented 4 years ago

Hi folks. Ok... I confirmed 3 out of the 4 tests I originally reported break pretty consistently..... here are results from a Kind cluster which I also ran these on.

I think antonin, maybe a fast path forward would be

(1) As a first priority, specifically adding a --ginkgo.focus=Basic StatefulSet functionality test on antrea job, as it would give you rapid signal as you mentioned, and that could run for every test

(2) We can work on adding a long-term full conformance suite job as well which runs nightly.

[Fail] [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] [It] should perform canary updates and phased rolling updates of template modifications [Conformance] 
/workspace/anago-v1.17.0-rc.2.10+70132b0f130acc/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/statefulset/wait.go:113

[Fail] [sig-apps] Daemon set [Serial] [It] should rollback without unnecessary restarts [Conformance] 
/workspace/anago-v1.17.0-rc.2.10+70132b0f130acc/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apps/daemon_set.go:417

[Fail] [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] [It] should perform rolling updates and roll backs of template modifications [Conformance] 
/workspace/anago-v1.17.0-rc.2.10+70132b0f130acc/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/statefulset/wait.go:129

jayunit100 commented 4 years ago

To go further into details here, it seems like what happening is that these statefulsets health checks, which i think involve node->pod connectivity, dont turn green even after certain image updates. Im not sure how it is that this could be antreas fault but am digging more now.


Jun  4 10:28:35.179: INFO: Updating stateful set ss

STEP: Rolling back update in reverse ordinal order
Jun  4 10:28:45.198: INFO: Running '/usr/local/bin/kubectl --kubeconfig=/tmp/kubeconfig-756192692 exec --namespace=statefulset-372 ss-1 -- /bin/sh -x -c mv -v /tmp/index.html /usr/local/apache2/htdocs/ || true'
Jun  4 10:28:45.338: INFO: stderr: "+ mv -v /tmp/index.html /usr/local/apache2/htdocs/\n"
Jun  4 10:28:45.338: INFO: stdout: "'/tmp/index.html' -> '/usr/local/apache2/htdocs/index.html'\n"
Jun  4 10:28:45.338: INFO: stdout of mv -v /tmp/index.html /usr/local/apache2/htdocs/ || true on ss-1: '/tmp/index.html' -> '/usr/local/apache2/htdocs/index.html'

Jun  4 10:38:45.350: FAIL: Failed waiting for state update: timed out waiting for the condition

tnqn commented 4 years ago

@jayunit100 thanks for finding this and digging into it! I tried to reproduce this and found error logs in antrea-agent:

interface_configuration_linux.go:112] Failed to find container interface eth0 in ns /host/proc/3387/ns/net

This error indicated GARPs were not sent. I think it may explain why the tests were flaky: only when enough number of tests had run in a cluster, leading to some IPs being reused before the Node's ARP cache expiry, the Node would fail to reach those Pods.

antrea-agent is supposed to send the GARPs, but I think the routine was broken when refactoring those methods for Windows support in 0.7.0: https://github.com/vmware-tanzu/antrea/commit/8bd2df81d2729dff3b88641a7f529f3c7920fb75. I'm pretty sure this is an issue but not sure whether it's the only one causing these failures. Anyway I have created #796 to fix the GARP problem first, I will do more rounds of tests to check whether the tests can succeed consistently with that fix.

jayunit100 commented 4 years ago

great awesome thanks tnqn. Keep me posted and let me know how i can help !

jayunit100 commented 4 years ago

@tnqn on a local cluster, an you run ....

sonobuoy run --e2e-focus "Basic StatefulSet functionality" --wait=600 --plugin e2e
watch kubectl get pods -n sonobuoy

That should quickly give you an indication - or alternatively just point me at a dockerhub image to test and ill swap it out in one of my clusters.

tnqn commented 4 years ago

@jayunit100 Appreciate your help! I cannot reproduce the failures locally when focusing on "Basic StatefulSet functionality", maybe it's because the pod cidr of my Node is /24 so no IPs were reused with a few tests. I'm running with "--mode=certified-conformance" but it will take time to confirm the success is stable. Since you can reproduce it in your cluster consistently, could you help verify the failures are gone with this image: qtian/antrea-ubuntu:0.7.0-460ff4d?

tnqn commented 4 years ago

I have run certified-conformance mode in two local clusters.

In the first cluster (4.15.0-66-generic), unfortunately, all the tests passed consistently with or without the fix (2 times without fix, 3 times with fix).
In the second cluster (4.4.0-45-generic), I did reproduce some of the above failures with 0.7.0 2 times, and all the tests passed after applying the patch, but I only ran once due to time limit.

Perhaps it's due to kernel difference that the newer one is more robust handling ARP cache.

@jayunit100 if you get time to run the tests in your cluster, please apply the latest yaml directly. The "latest" image has the fix. kubectl apply -f https://raw.githubusercontent.com/vmware-tanzu/antrea/master/build/yamls/antrea.yml

jayunit100 commented 4 years ago

      Summarizing 1 Failure:                                                                                                                                                                                                                              

[Fail] [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] [It] should perform canary updates and phased rolling updates of template modifications [Conformance]                                                     
/workspace/anago-v1.18.1-beta.0.38+49aac775931dd1/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/statefulset/wait.go:74                                                                                      

Ran 1 of 4992 Specs in 1851.695 seconds                                                                                                                                                                                                             
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 4991 Skipped                                                                                                                                                                                             
--- FAIL: TestE2E (1851.77s)                                                                                                                                                                                                                        wait=150
FAIL

0.7.1 Defeinetly still fails for me, ill try applying the master yaml .

jayunit100 commented 4 years ago

moved detailed comments to the other issue above, definetly seems to be a real bug here around statefulset ips and restarts.

tnqn commented 4 years ago

The failure only appeared when using containerd as CRI, and not docker, that's why @antoninbas and I couldn't reproduce it locally. I think this should be the root cause of this failure: The test created a statefulset and deleted a pod with 0 second graceful period. Since kubelet deletes pods asynchronously, and statefulset controller creates a new Pod with the same name once it sees the previous Pod removed from the Kubernetes API, there could be two Pods with the same name and different UIDs being handled by kubelet simultaneously due to the 0 second graceful period.

When docker as CRI, the previous Pod was deleted quickly, two CNI Del calls were called before the new Pod's CNI Add call, thus everything is fine.

Jun 10 03:16:08 k8s-01 kubelet[23894]: I0610 03:16:08.253367   23894 kubelet.go:1923] SyncLoop (DELETE, "api"): "ss2-0_statefulset-6387(7788acc0-7932-4a54-8831-93efd56e79ea)"
Jun 10 03:16:08 k8s-01 kubelet[23894]: I0610 03:16:08.271116   23894 kubelet.go:1917] SyncLoop (REMOVE, "api"): "ss2-0_statefulset-6387(7788acc0-7932-4a54-8831-93efd56e79ea)"
Jun 10 03:16:08 k8s-01 kubelet[23894]: I0610 03:16:08.271181   23894 kubelet_pods.go:1117] Killing unwanted pod "ss2-0"

Jun 10 03:16:08 k8s-01 kubelet[23894]: I0610 03:16:08.292156   23894 config.go:412] Receiving a new pod "ss2-0_statefulset-6387(a3cf65d7-d001-499b-9b3b-1a80200ce552)"
Jun 10 03:16:08 k8s-01 kubelet[23894]: I0610 03:16:08.292672   23894 kubelet.go:1907] SyncLoop (ADD, "api"): "ss2-0_statefulset-6387(a3cf65d7-d001-499b-9b3b-1a80200ce552)"
Jun 10 03:16:08 k8s-01 kubelet[23894]: I0610 03:16:08.316159   23894 kubelet.go:1920] SyncLoop (RECONCILE, "api"): "ss2-0_statefulset-6387(a3cf65d7-d001-499b-9b3b-1a80200ce552)"

Jun 10 03:16:08 k8s-01 kubelet[23894]: I0610 03:16:08.420122   23894 plugins.go:420] Calling network plugin cni to tear down pod "ss2-0_statefulset-6387"
Jun 10 03:16:08 k8s-01 kubelet[23894]: I0610 03:16:08.423441   23894 cni.go:380] Deleting statefulset-6387_ss2-0/138ee1517bd604ace90997d46f3d0bec4e3e0daa5f6242e18fb4058727068e50 from network antrea/antrea netns "/proc/27057/ns/net
Jun 10 03:16:08 k8s-01 kubelet[23894]: I0610 03:16:08.460943   23894 cni.go:388] Deleted statefulset-6387_ss2-0/138ee1517bd604ace90997d46f3d0bec4e3e0daa5f6242e18fb4058727068e50 from network antrea/antrea
Jun 10 03:16:08 k8s-01 kubelet[23894]: I0610 03:16:08.461050   23894 plugins.go:420] Calling network plugin cni to tear down pod "ss2-0_statefulset-6387"
Jun 10 03:16:08 k8s-01 kubelet[23894]: I0610 03:16:08.475793   23894 cni.go:380] Deleting statefulset-6387_ss2-0/138ee1517bd604ace90997d46f3d0bec4e3e0daa5f6242e18fb4058727068e50 from network antrea/antrea netns "/proc/27057/ns/net"
Jun 10 03:16:08 k8s-01 kubelet[23894]: I0610 03:16:08.492881   23894 cni.go:388] Deleted statefulset-6387_ss2-0/138ee1517bd604ace90997d46f3d0bec4e3e0daa5f6242e18fb4058727068e50 from network antrea/antrea

Jun 10 03:16:09 k8s-01 kubelet[23894]: I0610 03:16:09.264040   23894 plugins.go:406] Calling network plugin cni to set up pod "ss2-0_statefulset-6387"
Jun 10 03:16:09 k8s-01 kubelet[23894]: I0610 03:16:09.266223   23894 cni.go:361] Adding statefulset-6387_ss2-0/dddb03ad84cb986e81be41b82aa0feb78f46eeaa8abe6625fb52bb720748cebe
 to network antrea/antrea netns "/proc/28953/ns/net"

When containerd as CRI, there were multiple CNI Del calls, the second of which could be after the new Pod's CNI Add call, thus antrea-agent deleted the network interface whose name was computed from Pod namespace + Pod name and caused the networking issue.

Jun 10 05:20:31 k8s-01 kubelet[25124]: I0610 05:20:31.538278   25124 kubelet.go:1923] SyncLoop (DELETE, "api"): "ss2-0_statefulset-1624(3e9792d1-da6b-4190-9b64-923518be5b0c)"
Jun 10 05:20:31 k8s-01 kubelet[25124]: I0610 05:20:31.549780   25124 kubelet.go:1917] SyncLoop (REMOVE, "api"): "ss2-0_statefulset-1624(3e9792d1-da6b-4190-9b64-923518be5b0c)"
Jun 10 05:20:31 k8s-01 kubelet[25124]: I0610 05:20:31.549856   25124 kubelet_pods.go:1117] Killing unwanted pod "ss2-0"

Jun 10 05:20:31 k8s-01 kubelet[25124]: I0610 05:20:31.574788   25124 config.go:412] Receiving a new pod "ss2-0_statefulset-1624(5a185481-be38-4768-82a9-7db405e48ca0)"
Jun 10 05:20:31 k8s-01 kubelet[25124]: I0610 05:20:31.575517   25124 config.go:303] Setting pods for source api
Jun 10 05:20:31 k8s-01 kubelet[25124]: I0610 05:20:31.575599   25124 kubelet.go:1907] SyncLoop (ADD, "api"): "ss2-0_statefulset-1624(5a185481-be38-4768-82a9-7db405e48ca0)"
Jun 10 05:20:31 k8s-01 kubelet[25124]: I0610 05:20:31.624911   25124 kubelet.go:1920] SyncLoop (RECONCILE, "api"): "ss2-0_statefulset-1624(5a185481-be38-4768-82a9-7db405e48ca0)"

Jun 10 05:20:31 k8s-01 kernel: [450796.979885] device ss2-0-2cff28 left promiscuous mode
Jun 10 05:20:31 k8s-01 containerd[21612]: time="2020-06-10T05:20:31.708232740-07:00" level=info msg="TearDown network for sandbox "11befc0731961911f65e487c1258b10a40b269c1b51a5b7c09cae98ead31d6a9" successfully"
Jun 10 05:20:31 k8s-01 containerd[21612]: time="2020-06-10T05:20:31.799667058-07:00" level=info msg="StopPodSandbox for "11befc0731961911f65e487c1258b10a40b269c1b51a5b7c09cae98ead31d6a9" returns successfully"

Jun 10 05:20:31 k8s-01 containerd[21612]: time="2020-06-10T05:20:31.886234108-07:00" level=info msg="RunPodsandbox for &PodSandboxMetadata{Name:ss2-0,Uid:5a185481-be38-4768-82
a9-7db405e48ca0,Namespace:statefulset-1624,Attempt:0,}"
Jun 10 05:20:31 k8s-01 kernel: [450797.209545] IPVS: Creating netns size=2192 id=614
Jun 10 05:20:32 k8s-01 containerd[21612]: time="2020-06-10T05:20:32.827598301-07:00" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:ss2-0,Uid:5a185481-be38-4768-82a9-7db405e48ca0,Namespace:statefulset-1624,Attempt:0,} returns sandbox id "8ca331e5b4687d2bada2124737503160b179c2ea3d7aab754bbbb1ee2442d0cc""

Jun 10 05:20:33 k8s-01 containerd[21612]: time="2020-06-10T05:20:33.437023987-07:00" level=info msg="TearDown network for sandbox "11befc0731961911f65e487c1258b10a40b269c1b51a5b7c09cae98ead31d6a9" successfully"
Jun 10 05:20:33 k8s-01 containerd[21612]: time="2020-06-10T05:20:33.437104796-07:00" level=info msg="StopPodSandbox for "11befc0731961911f65e487c1258b10a40b269c1b51a5b7c09cae98ead31d6a9" returns successfully"

It might not be a new issue in 0.7 and might apply to previous versions when running with containerd.

jayunit100 commented 4 years ago

Ok gotcha. Is there a quick fix to this? maybe some duct tape we can put in for 0.7.2, which canbe more elgantly fixed later ?

tnqn commented 4 years ago

@jayunit100 I'm working it, will update to you once we have a proper fix.

jayunit100 commented 4 years ago

it looks like https://github.com/vmware-tanzu/antrea/pull/827 will fix the bug underlying the motivation for this issue , but i guess we should leave this issue open until conormance is running nightly

tnqn commented 4 years ago

I have updated "Fixes" to "For" so it won't close this one.

antoninbas commented 4 years ago

We have another issue to track the addition of daily full conformance testsuite jobs (#819), so feel free to close this one when #827 is merged.

antrea-io / antrea

[sig-apps] Conformance tests being skipped. #785