Closed Bobgy closed 3 years ago
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
kind/bug | 0.93 |
area/engprod | 0.89 |
platform/gcp | 0.68 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
Here's the error
INFO|2020-10-29T09:40:52|/workspace/testing-repo/py/kubeflow/testing/util.py|72| Error from server (InternalError): error when applying patch:
INFO|2020-10-29T09:40:52|/workspace/testing-repo/py/kubeflow/testing/util.py|72| {"metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"serviceusage.cnrm.cloud.google.com/v1beta1\",\"kind\":\"Service\",\"metadata\":{\"annotations\":{\"cnrm.cloud.google.com/deletion-policy\":\"abandon\",\"cnrm.cloud.google.com/disable-dependent-services\":\"false\"},\"labels\":{\"app.kubernetes.io/managed-by\":\"configmanagement.gke.io\",\"auto-deploy-base-name\":\"kf-ci\",\"auto-deploy-group\":\"gcp-blueprint-master\",\"kf-name\":\"kf-vbp-1029-99d\",\"tekton.dev/pipeline\":\"deploy-gcp-blueprint\",\"tekton.dev/pipelineRun\":\"deploy-kf-master-txrwf\",\"tekton.dev/pipelineTask\":\"deploy-gcp\",\"tekton.dev/task\":\"deploy-gcp-blueprint\",\"tekton.dev/taskRun\":\"deploy-kf-master-txrwf-deploy-gcp-ng9wf\"},\"name\":\"anthos.googleapis.com\",\"namespace\":\"kubeflow-ci-deployment\"}}\n"},"labels":{"auto-deploy-base-name":"kf-ci","blueprint-repo-commit":null,"kf-name":"kf-vbp-1029-99d","tekton.dev/pipelineRun":"deploy-kf-master-txrwf","tekton.dev/taskRun":"deploy-kf-master-txrwf-deploy-gcp-ng9wf"}}}
INFO|2020-10-29T09:40:52|/workspace/testing-repo/py/kubeflow/testing/util.py|72| to:
INFO|2020-10-29T09:40:52|/workspace/testing-repo/py/kubeflow/testing/util.py|72| Resource: "serviceusage.cnrm.cloud.google.com/v1beta1, Resource=services", GroupVersionKind: "serviceusage.cnrm.cloud.google.com/v1beta1, Kind=Service"
INFO|2020-10-29T09:40:52|/workspace/testing-repo/py/kubeflow/testing/util.py|72| Name: "anthos.googleapis.com", Namespace: "kubeflow-ci-deployment"
Based on test logs the management cluster is gke_kubeflow-ci_us-central1_kf-ci-management
cnrm-webhook-manager
Is reported as unhealthy
kubectl --context=kubeflow-ci-management -n cnrm-system describe pods cnrm-webhook-manager-544bfccb5d-ts5fk
Name: cnrm-webhook-manager-544bfccb5d-ts5fk
Namespace: cnrm-system
Priority: 0
Node: gke-kf-ci-management-kf-ci-management-d6b2e6ad-jgkd/10.128.0.49
Start Time: Mon, 19 Oct 2020 01:32:50 -0700
Labels: cnrm.cloud.google.com/component=cnrm-webhook-manager
cnrm.cloud.google.com/system=true
pod-template-hash=544bfccb5d
Annotations: cnrm.cloud.google.com/version: 1.9.1
Status: Running
IP: 10.20.1.8
Controlled By: ReplicaSet/cnrm-webhook-manager-544bfccb5d
Containers:
webhook:
Container ID: docker://87d89a1ddfc88138500f220c3103c0f145b149f334cdafee11a1bef8e98feec8
Image: gcr.io/cnrm-eap/webhook:97b6128
Image ID: docker-pullable://gcr.io/cnrm-eap/webhook@sha256:4d727354e9cde8efeafbaba16ef79766b1b1a92df3787df59f933b59b0eb1d61
Port: <none>
Host Port: <none>
Command:
/configconnector/webhook
Args:
--stderrthreshold=INFO
State: Running
Started: Mon, 19 Oct 2020 01:33:07 -0700
Ready: False
Restart Count: 0
Limits:
cpu: 40m
memory: 64Mi
Requests:
cpu: 20m
memory: 32Mi
Readiness: exec [cat /tmp/ready] delay=3s timeout=1s period=3s #success=1 #failure=3
Environment:
NAMESPACE: cnrm-system (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from cnrm-webhook-manager-token-b6f7k (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
cnrm-webhook-manager-token-b6f7k:
Type: Secret (a volume populated by a Secret)
SecretName: cnrm-webhook-manager-token-b6f7k
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 103s (x41111 over 9d) kubelet, gke-kf-ci-management-kf-ci-management-d6b2e6ad-jgkd Readiness probe failed: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown
I deleted the pod and a new one restarted.
kubectl --context=kubeflow-ci-management -n cnrm-system get pods
NAME READY STATUS RESTARTS AGE
cnrm-controller-manager-0 1/1 Running 0 10d
cnrm-deletiondefender-0 1/1 Running 0 10d
cnrm-resource-stats-recorder-64c9b6d496-7xsfc 1/1 Running 0 10d
cnrm-webhook-manager-544bfccb5d-ts5fk 0/1 Terminating 0 10d
cnrm-webhook-manager-544bfccb5d-w4qrq 1/1 Running 0 83s
The latest run of the pipelinerun now appears to be stuck getting started; there doesn't appear to be any taskruns
kubectl --context=kf-ci-v1 -n auto-deploy describe pipelineruns deploy-kf-master-5djt4
Name: deploy-kf-master-5djt4
Namespace: auto-deploy
Labels: auto-deploy-base-name=kf-ci
auto-deploy-group=gcp-blueprint-master
Annotations: <none>
API Version: tekton.dev/v1alpha1
Kind: PipelineRun
Metadata:
Creation Timestamp: 2020-10-29T14:40:34Z
Generate Name: deploy-kf-master-
Generation: 1
Resource Version: 226255709
Self Link: /apis/tekton.dev/v1alpha1/namespaces/auto-deploy/pipelineruns/deploy-kf-master-5djt4
UID: b3a2e05b-19f4-11eb-9df6-42010a8e00b4
Spec:
Params:
Name: artifacts-gcs
Value: gs://kubernetes-jenkins/pr-logs/pull/kubeflow_gcp-blueprints/137/kubeflow-gcp-blueprints-presubmit/1321824194224197632
Name: junit-path
Value: artifacts/junit_deploy-kf-master-
Name: test-target-name
Value: deploy
Pipeline Ref:
Name: deploy-gcp-blueprint
Resources:
Name: blueprint-repo
Resource Spec:
Params:
Name: url
Value: https://github.com/kubeflow/gcp-blueprints.git
Name: revision
Value: refs/pull/137/head
Type: git
Name: testing-repo
Resource Spec:
Params:
Name: revision
Value: master
Name: url
Value: https://github.com/kubeflow/testing.git
Type: git
Service Account Name: default-editor
Timeout: 1h0m0s
Events: <none>
I think we also need to upgrade our current version of config connector(1.9.1) to the latest one as mentioned https://github.com/GoogleCloudPlatform/k8s-config-connector/issues/252#issuecomment-696133051
It looks like there are a huge number of TektonPipelineRuns accumulating in the kf-ci-v1 namespace
Looks like we have 49749 pipelineruns. I'm wondering if that is affecting the Tekton pipelineruns controller and explain why the latest runs aren't starting.
Lets try deleting all the pipelineruns
kubectl --context=kf-ci-v1 -n kf-ci delete pipelineruns --all=tru
I'm running the program in kubeflow/testing#767 to try to GC all the finished taskruns and see if that fixes things.
Tekton pipelines appear to be running again.
/test all
The auto deployments are unhealthy. It looks to me like we are running short on quota for
in project kubeflow-ci-deployment
gcloud --project=kubeflow-ci-deployment container clusters list --format="table(name, zone, createTime)" --sort-by=createTime
NAME LOCATION CREATE_TIME
deployapp us-east1-d 2019-04-26T22:35:53+00:00
apps us-central1-a 2019-09-18T23:04:32+00:00
myapp2 us-central1-a 2019-10-03T04:18:04+00:00
kfctl-3263 us-central1-a 2020-10-22T11:03:24+00:00
kfctl-121b us-central1-a 2020-10-22T14:08:04+00:00
kfctl-c19d us-central1-a 2020-10-22T18:58:12+00:00
kfctl-4dda us-central1-a 2020-10-27T22:26:54+00:00
kfctl-b3cf us-central1-a 2020-10-27T23:28:22+00:00
kfctl-796e us-central1-a 2020-10-27T23:33:47+00:00
kfctl-9ad9 us-central1-a 2020-10-28T03:33:27+00:00
kfctl-b485 us-central1-a 2020-10-28T07:32:36+00:00
kfctl-fc27 us-central1-a 2020-10-28T07:34:23+00:00
kfctl-297d us-central1-a 2020-10-28T07:35:04+00:00
kfctl-9df7 us-central1-a 2020-10-28T11:41:12+00:00
kfctl-cbcc us-central1-a 2020-10-28T19:16:14+00:00
kfctl-5f95 us-central1-a 2020-10-28T19:35:51+00:00
kfctl-18dc us-central1-a 2020-10-28T19:36:08+00:00
kfctl-d3f9 us-central1-a 2020-10-28T23:43:09+00:00
kfctl-07dc us-central1-a 2020-10-29T03:37:24+00:00
kfctl-987d us-central1-a 2020-10-29T03:44:42+00:00
kfctl-a11d us-central1-a 2020-10-29T07:40:16+00:00
kfctl-1c1c us-central1-a 2020-10-29T11:11:20+00:00
kfctl-fd8d us-central1-a 2020-10-29T14:32:29+00:00
kf-v1-1029-49d us-east1-c 2020-10-29T20:04:58+00:00
I'm not sure where all of the kfctl clusters are coming from.
My conjecture is these are coming from the kfctl repo. https://github.com/kubeflow/kfctl/blob/v1.0-branch/prow_config.yaml
In particular I think they are coming from the previous release branches which are still running tests.
I selected all of the clusters except the "kf-v1" cluster for deletion. I'm not sure what all the apps clusters are for; I suspect at least some of them were for click to deploy which we are no longer using.
I also deleted a bunch of service accounts and IP addresses. This should hopefully free up some quota.
Judging by the cluster timestamps it looks like resources are still being GC'd as we don't have any older than 10/22. But we will need to find and shutdown the processes that keep creating them.
Hopefully the PRs referenced above will do this.
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
area/kfctl | 0.53 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
So it looks like the blueprint clusters were able to get created. The endpoints aren't accessible though.
gcloud --project=kubeflow-ci-deployment container clusters list --format="table(name, zone, createTime)" --sort-by=createTime
Reauthentication required.
Please insert and touch your security key
Timed out while waiting for security key touch.
Please insert and touch your security key
NAME LOCATION CREATE_TIME
kf-v1-1-1030-d8b us-central1-c 2020-10-30T02:55:18+00:00
kf-vbp-1030-b1f us-central1-c 2020-10-30T02:55:33+00:00
kf-v1-1030-72e us-east1-c 2020-10-30T08:08:28+00:00
Looking at the logs it looks like there was a timeout waiting
error: timed out waiting for the condition on containerclusters/kf-vbp-1030-b1f
It looks like the last blueprint cluster might have been deployed around the same time as I was cleaning up kubeflow-ci-deployment was getting deployed.
Another auto-deployment just started and it looks like the cluster was correctly created and it progressed to deploying the applications.
The latest autodeployment is up and healthy. https://kf-vbp-1030-3b5.endpoints.kubeflow-ci-deployment.cloud.goog/
Thank you @jlewi!
See https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubeflow_gcp-blueprints/137/kubeflow-gcp-blueprints-presubmit/1321719992365879296/