GoogleCloudPlatform / kubeflow-distribution

Blueprints for Deploying Kubeflow on Google Cloud Platform and Anthos
Apache License 2.0
78 stars 63 forks source link

presubmit test broken - 10.29 #143

Closed Bobgy closed 3 years ago

Bobgy commented 3 years ago

See https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubeflow_gcp-blueprints/137/kubeflow-gcp-blueprints-presubmit/1321719992365879296/

issue-label-bot[bot] commented 3 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.93
area/engprod 0.89
platform/gcp 0.68

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

jlewi commented 3 years ago

Here's the error

INFO|2020-10-29T09:40:52|/workspace/testing-repo/py/kubeflow/testing/util.py|72| Error from server (InternalError): error when applying patch:
INFO|2020-10-29T09:40:52|/workspace/testing-repo/py/kubeflow/testing/util.py|72| {"metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"serviceusage.cnrm.cloud.google.com/v1beta1\",\"kind\":\"Service\",\"metadata\":{\"annotations\":{\"cnrm.cloud.google.com/deletion-policy\":\"abandon\",\"cnrm.cloud.google.com/disable-dependent-services\":\"false\"},\"labels\":{\"app.kubernetes.io/managed-by\":\"configmanagement.gke.io\",\"auto-deploy-base-name\":\"kf-ci\",\"auto-deploy-group\":\"gcp-blueprint-master\",\"kf-name\":\"kf-vbp-1029-99d\",\"tekton.dev/pipeline\":\"deploy-gcp-blueprint\",\"tekton.dev/pipelineRun\":\"deploy-kf-master-txrwf\",\"tekton.dev/pipelineTask\":\"deploy-gcp\",\"tekton.dev/task\":\"deploy-gcp-blueprint\",\"tekton.dev/taskRun\":\"deploy-kf-master-txrwf-deploy-gcp-ng9wf\"},\"name\":\"anthos.googleapis.com\",\"namespace\":\"kubeflow-ci-deployment\"}}\n"},"labels":{"auto-deploy-base-name":"kf-ci","blueprint-repo-commit":null,"kf-name":"kf-vbp-1029-99d","tekton.dev/pipelineRun":"deploy-kf-master-txrwf","tekton.dev/taskRun":"deploy-kf-master-txrwf-deploy-gcp-ng9wf"}}}
INFO|2020-10-29T09:40:52|/workspace/testing-repo/py/kubeflow/testing/util.py|72| to:
INFO|2020-10-29T09:40:52|/workspace/testing-repo/py/kubeflow/testing/util.py|72| Resource: "serviceusage.cnrm.cloud.google.com/v1beta1, Resource=services", GroupVersionKind: "serviceusage.cnrm.cloud.google.com/v1beta1, Kind=Service"
INFO|2020-10-29T09:40:52|/workspace/testing-repo/py/kubeflow/testing/util.py|72| Name: "anthos.googleapis.com", Namespace: "kubeflow-ci-deployment"
jlewi commented 3 years ago

Based on test logs the management cluster is gke_kubeflow-ci_us-central1_kf-ci-management

cnrm-webhook-manager

Is reported as unhealthy

jlewi commented 3 years ago
kubectl --context=kubeflow-ci-management -n cnrm-system describe pods cnrm-webhook-manager-544bfccb5d-ts5fk
Name:           cnrm-webhook-manager-544bfccb5d-ts5fk
Namespace:      cnrm-system
Priority:       0
Node:           gke-kf-ci-management-kf-ci-management-d6b2e6ad-jgkd/10.128.0.49
Start Time:     Mon, 19 Oct 2020 01:32:50 -0700
Labels:         cnrm.cloud.google.com/component=cnrm-webhook-manager
                cnrm.cloud.google.com/system=true
                pod-template-hash=544bfccb5d
Annotations:    cnrm.cloud.google.com/version: 1.9.1
Status:         Running
IP:             10.20.1.8
Controlled By:  ReplicaSet/cnrm-webhook-manager-544bfccb5d
Containers:
  webhook:
    Container ID:  docker://87d89a1ddfc88138500f220c3103c0f145b149f334cdafee11a1bef8e98feec8
    Image:         gcr.io/cnrm-eap/webhook:97b6128
    Image ID:      docker-pullable://gcr.io/cnrm-eap/webhook@sha256:4d727354e9cde8efeafbaba16ef79766b1b1a92df3787df59f933b59b0eb1d61
    Port:          <none>
    Host Port:     <none>
    Command:
      /configconnector/webhook
    Args:
      --stderrthreshold=INFO
    State:          Running
      Started:      Mon, 19 Oct 2020 01:33:07 -0700
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     40m
      memory:  64Mi
    Requests:
      cpu:      20m
      memory:   32Mi
    Readiness:  exec [cat /tmp/ready] delay=3s timeout=1s period=3s #success=1 #failure=3
    Environment:
      NAMESPACE:  cnrm-system (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from cnrm-webhook-manager-token-b6f7k (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  cnrm-webhook-manager-token-b6f7k:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cnrm-webhook-manager-token-b6f7k
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                    From                                                          Message
  ----     ------     ----                   ----                                                          -------
  Warning  Unhealthy  103s (x41111 over 9d)  kubelet, gke-kf-ci-management-kf-ci-management-d6b2e6ad-jgkd  Readiness probe failed: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown
jlewi commented 3 years ago

I deleted the pod and a new one restarted.

kubectl --context=kubeflow-ci-management -n cnrm-system get pods
NAME                                            READY   STATUS        RESTARTS   AGE
cnrm-controller-manager-0                       1/1     Running       0          10d
cnrm-deletiondefender-0                         1/1     Running       0          10d
cnrm-resource-stats-recorder-64c9b6d496-7xsfc   1/1     Running       0          10d
cnrm-webhook-manager-544bfccb5d-ts5fk           0/1     Terminating   0          10d
cnrm-webhook-manager-544bfccb5d-w4qrq           1/1     Running       0          83s
jlewi commented 3 years ago

The latest run of the pipelinerun now appears to be stuck getting started; there doesn't appear to be any taskruns

kubectl --context=kf-ci-v1 -n auto-deploy describe pipelineruns deploy-kf-master-5djt4
Name:         deploy-kf-master-5djt4
Namespace:    auto-deploy
Labels:       auto-deploy-base-name=kf-ci
              auto-deploy-group=gcp-blueprint-master
Annotations:  <none>
API Version:  tekton.dev/v1alpha1
Kind:         PipelineRun
Metadata:
  Creation Timestamp:  2020-10-29T14:40:34Z
  Generate Name:       deploy-kf-master-
  Generation:          1
  Resource Version:    226255709
  Self Link:           /apis/tekton.dev/v1alpha1/namespaces/auto-deploy/pipelineruns/deploy-kf-master-5djt4
  UID:                 b3a2e05b-19f4-11eb-9df6-42010a8e00b4
Spec:
  Params:
    Name:   artifacts-gcs
    Value:  gs://kubernetes-jenkins/pr-logs/pull/kubeflow_gcp-blueprints/137/kubeflow-gcp-blueprints-presubmit/1321824194224197632
    Name:   junit-path
    Value:  artifacts/junit_deploy-kf-master-
    Name:   test-target-name
    Value:  deploy
  Pipeline Ref:
    Name:  deploy-gcp-blueprint
  Resources:
    Name:  blueprint-repo
    Resource Spec:
      Params:
        Name:   url
        Value:  https://github.com/kubeflow/gcp-blueprints.git
        Name:   revision
        Value:  refs/pull/137/head
      Type:     git
    Name:       testing-repo
    Resource Spec:
      Params:
        Name:            revision
        Value:           master
        Name:            url
        Value:           https://github.com/kubeflow/testing.git
      Type:              git
  Service Account Name:  default-editor
  Timeout:               1h0m0s
Events:                  <none>
subodh101 commented 3 years ago

I think we also need to upgrade our current version of config connector(1.9.1) to the latest one as mentioned https://github.com/GoogleCloudPlatform/k8s-config-connector/issues/252#issuecomment-696133051

jlewi commented 3 years ago

It looks like there are a huge number of TektonPipelineRuns accumulating in the kf-ci-v1 namespace

Looks like we have 49749 pipelineruns. I'm wondering if that is affecting the Tekton pipelineruns controller and explain why the latest runs aren't starting.

pipelineruns.txt

Lets try deleting all the pipelineruns

kubectl --context=kf-ci-v1 -n kf-ci delete pipelineruns --all=tru
jlewi commented 3 years ago

I'm running the program in kubeflow/testing#767 to try to GC all the finished taskruns and see if that fixes things.

jlewi commented 3 years ago

Tekton pipelines appear to be running again.

/test all

jlewi commented 3 years ago

The auto deployments are unhealthy. It looks to me like we are running short on quota for

in project kubeflow-ci-deployment

gcloud --project=kubeflow-ci-deployment container clusters list --format="table(name, zone, createTime)" --sort-by=createTime
NAME            LOCATION       CREATE_TIME
deployapp       us-east1-d     2019-04-26T22:35:53+00:00
apps            us-central1-a  2019-09-18T23:04:32+00:00
myapp2          us-central1-a  2019-10-03T04:18:04+00:00
kfctl-3263      us-central1-a  2020-10-22T11:03:24+00:00
kfctl-121b      us-central1-a  2020-10-22T14:08:04+00:00
kfctl-c19d      us-central1-a  2020-10-22T18:58:12+00:00
kfctl-4dda      us-central1-a  2020-10-27T22:26:54+00:00
kfctl-b3cf      us-central1-a  2020-10-27T23:28:22+00:00
kfctl-796e      us-central1-a  2020-10-27T23:33:47+00:00
kfctl-9ad9      us-central1-a  2020-10-28T03:33:27+00:00
kfctl-b485      us-central1-a  2020-10-28T07:32:36+00:00
kfctl-fc27      us-central1-a  2020-10-28T07:34:23+00:00
kfctl-297d      us-central1-a  2020-10-28T07:35:04+00:00
kfctl-9df7      us-central1-a  2020-10-28T11:41:12+00:00
kfctl-cbcc      us-central1-a  2020-10-28T19:16:14+00:00
kfctl-5f95      us-central1-a  2020-10-28T19:35:51+00:00
kfctl-18dc      us-central1-a  2020-10-28T19:36:08+00:00
kfctl-d3f9      us-central1-a  2020-10-28T23:43:09+00:00
kfctl-07dc      us-central1-a  2020-10-29T03:37:24+00:00
kfctl-987d      us-central1-a  2020-10-29T03:44:42+00:00
kfctl-a11d      us-central1-a  2020-10-29T07:40:16+00:00
kfctl-1c1c      us-central1-a  2020-10-29T11:11:20+00:00
kfctl-fd8d      us-central1-a  2020-10-29T14:32:29+00:00
kf-v1-1029-49d  us-east1-c     2020-10-29T20:04:58+00:00

I'm not sure where all of the kfctl clusters are coming from.

jlewi commented 3 years ago

My conjecture is these are coming from the kfctl repo. https://github.com/kubeflow/kfctl/blob/v1.0-branch/prow_config.yaml

In particular I think they are coming from the previous release branches which are still running tests.

jlewi commented 3 years ago

I selected all of the clusters except the "kf-v1" cluster for deletion. I'm not sure what all the apps clusters are for; I suspect at least some of them were for click to deploy which we are no longer using.

I also deleted a bunch of service accounts and IP addresses. This should hopefully free up some quota.

Judging by the cluster timestamps it looks like resources are still being GC'd as we don't have any older than 10/22. But we will need to find and shutdown the processes that keep creating them.

Hopefully the PRs referenced above will do this.

issue-label-bot[bot] commented 3 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
area/kfctl 0.53

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

jlewi commented 3 years ago

So it looks like the blueprint clusters were able to get created. The endpoints aren't accessible though.

gcloud --project=kubeflow-ci-deployment container clusters list --format="table(name, zone, createTime)" --sort-by=createTime
Reauthentication required.
Please insert and touch your security key
Timed out while waiting for security key touch.
Please insert and touch your security key
NAME              LOCATION       CREATE_TIME
kf-v1-1-1030-d8b  us-central1-c  2020-10-30T02:55:18+00:00
kf-vbp-1030-b1f   us-central1-c  2020-10-30T02:55:33+00:00
kf-v1-1030-72e    us-east1-c     2020-10-30T08:08:28+00:00
jlewi commented 3 years ago

Looking at the logs it looks like there was a timeout waiting

error: timed out waiting for the condition on containerclusters/kf-vbp-1030-b1f

It looks like the last blueprint cluster might have been deployed around the same time as I was cleaning up kubeflow-ci-deployment was getting deployed.

jlewi commented 3 years ago

Another auto-deployment just started and it looks like the cluster was correctly created and it progressed to deploying the applications.

jlewi commented 3 years ago

The latest autodeployment is up and healthy. https://kf-vbp-1030-3b5.endpoints.kubeflow-ci-deployment.cloud.goog/

Bobgy commented 3 years ago

Thank you @jlewi!