antrea-io / antrea

Kubernetes networking based on Open vSwitch
https://antrea.io
Apache License 2.0
1.66k stars 368 forks source link

cloud-antrea-gke-conformance-net-policy test failed consistently after bumping up test image to 1.22.8 #3762

Closed tnqn closed 1 year ago

tnqn commented 2 years ago

Describe the bug

The test started failing after merging https://github.com/antrea-io/antrea/pull/3700, which made test use the same version of image as the cluster's version. The cluster in GKE is 1.22.8-gke.200, the test image was updated from 1.18.5 to 1.22.8. Look at the failures, all of them were because creating Pod failed:

[sig-network] Netpol
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/common/framework.go:23
  NetworkPolicy between server and client
  /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/netpol/network_policy.go:124
    should deny egress from pods based on PodSelector [Feature:NetworkPolicy]  [BeforeEach]
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/netpol/network_policy.go:653

    May 10 10:33:14.733: unable to initialize resources
    Unexpected error:
        <*fmt.wrapError | 0xc004370ce0>: {
            msg: "unable to update pod netpol-4938-x/a: pods \"a\" is forbidden: error looking up service account netpol-4938-x/default: serviceaccount \"default\" not found",
            err: {
                ErrStatus: {
                    TypeMeta: {Kind: "", APIVersion: ""},
                    ListMeta: {
                        SelfLink: "",
                        ResourceVersion: "",
                        Continue: "",
                        RemainingItemCount: nil,
                    },
                    Status: "Failure",
                    Message: "pods \"a\" is forbidden: error looking up service account netpol-4938-x/default: serviceaccount \"default\" not found",
                    Reason: "Forbidden",
                    Details: {Name: "a", Group: "", Kind: "pods", UID: "", Causes: nil, RetryAfterSeconds: 0},
                    Code: 403,
                },
            },
        }
        unable to update pod netpol-4938-x/a: pods "a" is forbidden: error looking up service account netpol-4938-x/default: serviceaccount "default" not found
    occurred

    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/network/netpol/network_policy.go:1369

And each build failed on different cases but all of them were added by the new NetworkPolicy framework: https://github.com/kubernetes/kubernetes/pull/91592.

Failed tests:
[sig-network] Netpol NetworkPolicy between server and client should work with Ingress, Egress specified together [Feature:NetworkPolicy]
[sig-network] Netpol [Feature:SCTPConnectivity][LinuxOnly][Disruptive] NetworkPolicy between server and client using SCTP should enforce policy to allow traffic only from a pod in a different namespace based on PodSelector and NamespaceSelector [Feature:NetworkPolicy]
[sig-network] Netpol NetworkPolicy between server and client should allow ingress access on one named port [Feature:NetworkPolicy]
[sig-network] Netpol NetworkPolicy between server and client should enforce policy based on PodSelector or NamespaceSelector [Feature:NetworkPolicy]
[sig-network] Netpol [Feature:SCTPConnectivity][LinuxOnly][Disruptive] NetworkPolicy between server and client using SCTP should enforce policy based on Ports [Feature:NetworkPolicy]
[sig-network] Netpol NetworkPolicy between server and client should enforce multiple egress policies with egress allow-all policy taking precedence [Feature:NetworkPolicy]
[sig-network] Netpol [LinuxOnly] NetworkPolicy between server and client using UDP should support a 'default-deny-ingress' policy [Feature:NetworkPolicy]
[sig-network] Netpol NetworkPolicy between server and client should enforce policy to allow traffic only from a different namespace, based on NamespaceSelector [Feature:NetworkPolicy]
[sig-network] Netpol NetworkPolicy between server and client should support denying of egress traffic on the client side (even if the server explicitly allows this traffic) [Feature:NetworkPolicy]
[sig-network] Netpol NetworkPolicy between server and client should properly isolate pods that are selected by a policy allowing SCTP, even if the plugin doesn't support SCTP [Feature:NetworkPolicy]
[sig-network] Netpol NetworkPolicy between server and client should ensure an IP overlapping both IPBlock.CIDR and IPBlock.Except is allowed [Feature:NetworkPolicy]
[sig-network] Netpol NetworkPolicy between server and client should enforce ingress policy allowing any port traffic to a server on a specific protocol [Feature:NetworkPolicy] [Feature:UDP]
[sig-network] Netpol NetworkPolicy between server and client should support allow-all policy [Feature:NetworkPolicy]
Failed tests:
[sig-network] Netpol NetworkPolicy between server and client should allow egress access to server in CIDR block [Feature:NetworkPolicy]
[sig-network] Netpol NetworkPolicy between server and client should enforce policy based on NamespaceSelector with MatchExpressions[Feature:NetworkPolicy]
[sig-network] Netpol NetworkPolicy between server and client should not mistakenly treat 'protocol: SCTP' as 'protocol: TCP', even if the plugin doesn't support SCTP [Feature:NetworkPolicy]
[sig-network] Netpol NetworkPolicy between server and client should support a 'default-deny-ingress' policy [Feature:NetworkPolicy]
[sig-network] Netpol NetworkPolicy between server and client should allow egress access on one named port [Feature:NetworkPolicy]
[sig-network] Netpol NetworkPolicy between server and client should enforce policy to allow ingress traffic from pods in all namespaces [Feature:NetworkPolicy]

It doesn't seem related to antrea because creating Pod was rejected by kube-apiserver when it didn't find the service account which should be created by kube-controller-manager. But I cannot get kube-controller-manager logs from kubectl because K8s control plane components don't run as Pods in GKE, so not sure what happened.

I tried to run same upstream version 1.22.8 in my local env, the test passed. Not sure which customization was made in 1.22.8-gke.200.

I'm trying to test conformance image 1.20.15 (which doesn't have new networkpolicy test) and 1.21.0 (which has it) to see if it's just related to test image.

tnqn commented 2 years ago

I can confirm test image 1.20.15 can pass while 1.21.0 cannot. So it's related to test image.

And while I was testing, GKE happened to update the default K8s version, it's now 1.23.5-gke.2400 and node image with docker runtime can no longer be used:

Fetching server config for us-west1-a
+ K8S_VERSION=1.23.5-gke.2400
...
=== Creating a cluster in GKE ===
+ /home/ubuntu/google-cloud-sdk/bin/gcloud container --project antrea clusters create antrea-gke-415 --image-type UBUNTU --machine-type e2-standard-4 --cluster-version 1.23.5-gke.2400 --zone us-west1-a --enable-ip-alias --services-ipv4-cidr 10.94.0.0/16
WARNING: Modifications on the boot disks of node VMs do not persist across node recreations. Nodes are recreated during manual-upgrade, auto-upgrade, auto-repair, and auto-scaling. To preserve modifications across node recreation, use a DaemonSet.
WARNING: Starting with version 1.18, clusters will have shielded GKE nodes by default.
WARNING: The Pod address range limits the maximum size of the cluster. Please refer to https://cloud.google.com/kubernetes-engine/docs/how-to/flexible-pod-cidr to learn how to optimize IP address allocation.
ERROR: (gcloud.container.clusters.create) ResponseError: code=400, message=Creation of node pools using node images based on Docker container runtimes is not supported in GKE v1.23. This is to prepare for the removal of Dockershim in Kubernetes v1.24. We recommend that you migrate to image types based on Containerd (examples). For more information, contact Cloud Support.

Created #3771 as a temporary fix.

tnqn commented 2 years ago

Since about July 1, GKE's default cluster version became v1.24.1, which no longer creates secret for service account automatically, the verification in old test cases would fail: https://github.com/kubernetes/kubernetes/blob/8f1e5bf0b9729a899b8df86249b56e2c74aebc55/test/e2e/framework/util.go#L294 https://jenkins.antrea-ci.rocks/view/cloud/job/cloud-antrea-gke-conformance-net-policy/449/console

We would need to either pin K8s version to < v1.24.0, or figure out why GKE running new netpol test suite would fail to create Pod because of missing serviceaccount.

antoninbas commented 2 years ago

@tnqn it seems to me that the issue is with the netpol tests themselves

The netpol tests use their own createNamespace function for Namespace creation: https://github.com/kubernetes/kubernetes/blob/5ac563c507cd75c9382a2a23a3c8e3452138a021/test/e2e/network/netpol/kubemanager.go#L177-L185

While the K8s e2e test framework already provides a robust Namespace creation function: https://github.com/kubernetes/kubernetes/blob/4569e646ef161c0262d433aed324fec97a525572/test/e2e/framework/util.go#L352-L400

The function comment on CreateTestingNS states that:

CreateTestingNS should be used by every test

and the function includes a check to wait for default ServiceAccount creation:

    if TestContext.VerifyServiceAccount {
        if err := WaitForDefaultServiceAccountInNamespace(c, got.Name); err != nil {
            // Even if we fail to create serviceAccount in the namespace,
            // we have successfully create a namespace.
            // So, return the created namespace.
            return got, err
        }
    }

IMO, the netpol tests should be updated to use this function or, if this is unpractical, the above wait should be duplicated in the netpol-specific createNamespace function. Let me know what you think.

tnqn commented 2 years ago

@antoninbas your analysis makes a lot of sense to me. Now I can understand why it failed in netpol tests only. Would you create a fix upstream? If so, I think we could set cluster version in GKT test for now and unset both conformance and cluster versions after GKE's default cluster version bumps up to a release including the fix.

antoninbas commented 2 years ago

@tnqn I searched through the K8s repo, and realized that there was an issue for this already: https://github.com/kubernetes/kubernetes/issues/108298

It also has a corresponding PR. Let me see if I can get a status update.

tnqn commented 2 years ago

@antoninbas since we know the issue is in NetPol tests, I guess for now we could skip them for GKE and set KUBE_CONFORMANCE_IMAGE_VERSION to auto to test default GKE cluster version and corresponding conformance version?

antoninbas commented 2 years ago

@tnqn yes sounds good to me. Meanwhile I will work on the upstream PR, but it could take some time

antoninbas commented 2 years ago

Submitted a PR upstream for this: https://github.com/kubernetes/kubernetes/pull/111789

antoninbas commented 2 years ago

The PR was merged upstream and will be in K8s v1.26

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

antoninbas commented 1 year ago

Keeping this open until K8 v1.26 is the default version in GKE, at which point we will no longer need to skip Netpol tests

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days