Closed tnqn closed 1 year ago
I can confirm test image 1.20.15 can pass while 1.21.0 cannot. So it's related to test image.
And while I was testing, GKE happened to update the default K8s version, it's now 1.23.5-gke.2400 and node image with docker runtime can no longer be used:
Fetching server config for us-west1-a
+ K8S_VERSION=1.23.5-gke.2400
...
=== Creating a cluster in GKE ===
+ /home/ubuntu/google-cloud-sdk/bin/gcloud container --project antrea clusters create antrea-gke-415 --image-type UBUNTU --machine-type e2-standard-4 --cluster-version 1.23.5-gke.2400 --zone us-west1-a --enable-ip-alias --services-ipv4-cidr 10.94.0.0/16
WARNING: Modifications on the boot disks of node VMs do not persist across node recreations. Nodes are recreated during manual-upgrade, auto-upgrade, auto-repair, and auto-scaling. To preserve modifications across node recreation, use a DaemonSet.
WARNING: Starting with version 1.18, clusters will have shielded GKE nodes by default.
WARNING: The Pod address range limits the maximum size of the cluster. Please refer to https://cloud.google.com/kubernetes-engine/docs/how-to/flexible-pod-cidr to learn how to optimize IP address allocation.
ERROR: (gcloud.container.clusters.create) ResponseError: code=400, message=Creation of node pools using node images based on Docker container runtimes is not supported in GKE v1.23. This is to prepare for the removal of Dockershim in Kubernetes v1.24. We recommend that you migrate to image types based on Containerd (examples). For more information, contact Cloud Support.
Created #3771 as a temporary fix.
Since about July 1, GKE's default cluster version became v1.24.1, which no longer creates secret for service account automatically, the verification in old test cases would fail: https://github.com/kubernetes/kubernetes/blob/8f1e5bf0b9729a899b8df86249b56e2c74aebc55/test/e2e/framework/util.go#L294 https://jenkins.antrea-ci.rocks/view/cloud/job/cloud-antrea-gke-conformance-net-policy/449/console
We would need to either pin K8s version to < v1.24.0, or figure out why GKE running new netpol test suite would fail to create Pod because of missing serviceaccount.
@tnqn it seems to me that the issue is with the netpol tests themselves
The netpol tests use their own createNamespace
function for Namespace creation:
https://github.com/kubernetes/kubernetes/blob/5ac563c507cd75c9382a2a23a3c8e3452138a021/test/e2e/network/netpol/kubemanager.go#L177-L185
While the K8s e2e test framework already provides a robust Namespace creation function: https://github.com/kubernetes/kubernetes/blob/4569e646ef161c0262d433aed324fec97a525572/test/e2e/framework/util.go#L352-L400
The function comment on CreateTestingNS
states that:
CreateTestingNS should be used by every test
and the function includes a check to wait for default ServiceAccount creation:
if TestContext.VerifyServiceAccount {
if err := WaitForDefaultServiceAccountInNamespace(c, got.Name); err != nil {
// Even if we fail to create serviceAccount in the namespace,
// we have successfully create a namespace.
// So, return the created namespace.
return got, err
}
}
IMO, the netpol tests should be updated to use this function or, if this is unpractical, the above wait should be duplicated in the netpol-specific createNamespace
function. Let me know what you think.
@antoninbas your analysis makes a lot of sense to me. Now I can understand why it failed in netpol tests only. Would you create a fix upstream? If so, I think we could set cluster version in GKT test for now and unset both conformance and cluster versions after GKE's default cluster version bumps up to a release including the fix.
@tnqn I searched through the K8s repo, and realized that there was an issue for this already: https://github.com/kubernetes/kubernetes/issues/108298
It also has a corresponding PR. Let me see if I can get a status update.
@antoninbas since we know the issue is in NetPol tests, I guess for now we could skip them for GKE and set KUBE_CONFORMANCE_IMAGE_VERSION
to auto to test default GKE cluster version and corresponding conformance version?
@tnqn yes sounds good to me. Meanwhile I will work on the upstream PR, but it could take some time
Submitted a PR upstream for this: https://github.com/kubernetes/kubernetes/pull/111789
The PR was merged upstream and will be in K8s v1.26
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days
Keeping this open until K8 v1.26 is the default version in GKE, at which point we will no longer need to skip Netpol tests
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days
Describe the bug
The test started failing after merging https://github.com/antrea-io/antrea/pull/3700, which made test use the same version of image as the cluster's version. The cluster in GKE is 1.22.8-gke.200, the test image was updated from 1.18.5 to 1.22.8. Look at the failures, all of them were because creating Pod failed:
And each build failed on different cases but all of them were added by the new NetworkPolicy framework: https://github.com/kubernetes/kubernetes/pull/91592.
It doesn't seem related to antrea because creating Pod was rejected by kube-apiserver when it didn't find the service account which should be created by kube-controller-manager. But I cannot get kube-controller-manager logs from kubectl because K8s control plane components don't run as Pods in GKE, so not sure what happened.
I tried to run same upstream version 1.22.8 in my local env, the test passed. Not sure which customization was made in 1.22.8-gke.200.
I'm trying to test conformance image 1.20.15 (which doesn't have new networkpolicy test) and 1.21.0 (which has it) to see if it's just related to test image.