canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
https://microk8s.io
Apache License 2.0
8.26k stars 759 forks source link

microk8s kubeflow: hang on enable with various pod errors #1698

Closed didier-durand closed 3 years ago

didier-durand commented 3 years ago

Hello,

Trying to enable kubeflow on microk8s (1.19 classic) on GCE (large instance n2-standard-8: 8 cores - 32 GB) : it hangs up forever on this message

Waited 615s for operator pods to come up, 18 remaining. Waited 630s for operator pods to come up, 18 remaining. Waited 645s for operator pods to come up, 18 remaining. Waited 660s for operator pods to come up, 18 remaining.

After 24m running, I get the following below (I have read #1071. . rbac is disabled so it does not seem to be the cause)

How can I get it to work? Let me know if additional info is required.

Thanks

Didier

microk8s kubectl get pods -n kubeflow NAME READY STATUS RESTARTS AGE argo-controller-6fc8f85d44-ljxv4 0/1 Evicted 0 25m oidc-gatekeeper-778cc55547-qb9sc 0/1 Evicted 0 25m pipelines-api-7d67b7f44f-d6sjf 0/1 Evicted 0 25m oidc-gatekeeper-778cc55547-gb74j 0/1 Evicted 0 25m pipelines-api-7d67b7f44f-g67jz 0/1 Evicted 0 25m oidc-gatekeeper-778cc55547-h8fxv 0/1 Evicted 0 25m pipelines-api-7d67b7f44f-cvf6p 0/1 Evicted 0 25m pipelines-api-7d67b7f44f-cwwss 0/1 Evicted 0 25m pipelines-api-7d67b7f44f-7967z 0/1 Evicted 0 25m pipelines-api-7d67b7f44f-rdb2n 0/1 Evicted 0 25m ambassador-85b668dcc4-68m92 0/1 Evicted 0 28m metadata-envoy-5cd4f47775-f66s9 0/1 Evicted 0 26m argo-controller-6fc8f85d44-kg8jn 0/1 Evicted 0 25m dex-auth-86d9765856-2mt27 0/1 Init:0/1 0 25m minio-operator-0 0/1 Unknown 0 26m pytorch-operator-operator-0 0/1 Unknown 0 24m metadata-grpc-operator-0 0/1 Unknown 0 27m argo-controller-6fc8f85d44-85c6g 0/1 Init:0/1 0 17m katib-db-manager-operator-0 0/1 ContainerCreating 0 27m dex-auth-6c7bd6d48d-54278 0/1 Evicted 0 24m kubeflow-profiles-operator-0 0/1 Pending 0 8m10s dex-auth-6c7bd6d48d-rszjz 0/1 Pending 0 8m7s dex-auth-operator-0 0/1 Unknown 1 24m katib-db-operator-0 0/1 Unknown 0 25m metadata-db-operator-0 0/1 Unknown 0 24m jupyter-web-operator-0 0/1 Unknown 0 24m metacontroller-operator-0 0/1 Unknown 0 27m oidc-gatekeeper-operator-0 0/1 Unknown 0 26m ambassador-operator-0 0/1 Unknown 1 29m jupyter-controller-operator-0 0/1 Unknown 0 27m ambassador-85b668dcc4-q5dlm 0/1 Init:Unknown 0 24m metadata-envoy-operator-0 0/1 Unknown 0 27m pipelines-viewer-5df646f87d-sntgq 0/1 Init:Unknown 0 17m metacontroller-8f65dd64-jgwdc 0/1 Init:Unknown 0 27m pipelines-scheduledworkflow-595cff68b7-sf25x 0/1 Init:Unknown 0 26m pipelines-db-0 0/1 Init:Unknown 0 24m kubeflow-dashboard-operator-0 0/1 Unknown 0 27m oidc-gatekeeper-778cc55547-7j6ds 0/1 Init:Unknown 0 25m metadata-api-operator-0 0/1 Unknown 1 24m jupyter-controller-d4c6989fd-fqrls 0/1 Init:Unknown 1 27m tf-job-operator-operator-0 0/1 Unknown 0 24m pipelines-scheduledworkflow-operator-0 0/1 Unknown 0 27m argo-ui-operator-0 1/1 Running 1 28m katib-ui-operator-0 0/1 Unknown 0 24m argo-controller-operator-0 1/1 Running 1 28m pipelines-ui-646f785cf6-gkbsg 0/1 Init:Unknown 0 25m seldon-core-operator-0 0/1 Unknown 0 24m pipelines-viewer-operator-0 0/1 Unknown 0 24m pipelines-ui-operator-0 0/1 Pending 0 31s pipelines-visualization-operator-0 0/1 Unknown 1 25m minio-0 0/1 Init:Unknown 0 25m metadata-ui-operator-0 1/1 Running 1 24m argo-ui-868cc7c496-6q5f8 0/1 Init:0/1 2 27m katib-controller-operator-0 1/1 Running 1 24m pipelines-api-7d67b7f44f-kwbqk 0/1 Init:0/1 0 25m pipelines-api-operator-0 0/1 Unknown 0 27m pipelines-db-operator-0 0/1 Unknown 0 27m metadata-envoy-5cd4f47775-pz5gj 0/1 Init:0/1 0 24m pipelines-persistence-operator-0 0/1 CreateContainerError 1 24m

ktsakalozos commented 3 years ago

Hi @didier-durand

i see some pods evicted. This usually happens because there not enough resources disk/memory. The inspection tarball might tell us more on what might be happening.

didier-durand commented 3 years ago

Attached is the inspection report taken after approx 6 minutes of install:

inspection-report-20201030_142237.tar.gz

The machine is a Google Cloud GCE instance: n2-standard-8: 8 cores - 32 GB.

Didier

knkski commented 3 years ago

@didier-durand: I see in the inspection report that the node has disk pressure. You might need to up the disk space, I would recommend a minimum of 80GB to be sure that it works.

didier-durand commented 3 years ago

@knkski :

That was the issue I raised my boot disk space to 100 GB and it went through for the operator pods. Thanks for that! But, now it stops on a new issue after all 30 operator pods got started (full trace below)

Waiting for service pods to become ready. Kubeflow could not be enabled: Error from server (NotFound): mutatingwebhookconfigurations.admissionregistration.k8s.io "katib-mutating-webhook-config" not found Error from server (NotFound): validatingwebhookconfigurations.admissionregistration.k8s.io "katib-validating-webhook-config" not found

Can you please further help? Thanks.

DIdier

Enabling dns... Enabling storage... Enabling dashboard... Enabling ingress... Enabling metallb:10.64.140.43-10.64.140.49... Waiting for DNS and storage plugins to finish setting up Deploying Kubeflow... Kubeflow deployed. Waiting for operator pods to become ready. Waited 0s for operator pods to come up, 30 remaining. Waited 15s for operator pods to come up, 29 remaining. Waited 30s for operator pods to come up, 29 remaining. Waited 45s for operator pods to come up, 28 remaining. Waited 60s for operator pods to come up, 28 remaining. Waited 75s for operator pods to come up, 26 remaining. Waited 90s for operator pods to come up, 24 remaining. Waited 105s for operator pods to come up, 20 remaining. Waited 120s for operator pods to come up, 20 remaining. Waited 135s for operator pods to come up, 17 remaining. Waited 150s for operator pods to come up, 16 remaining. Waited 165s for operator pods to come up, 16 remaining. Waited 180s for operator pods to come up, 15 remaining. Waited 195s for operator pods to come up, 13 remaining. Waited 210s for operator pods to come up, 11 remaining. Waited 225s for operator pods to come up, 11 remaining. Waited 240s for operator pods to come up, 9 remaining. Waited 255s for operator pods to come up, 7 remaining. Waited 270s for operator pods to come up, 7 remaining. Waited 285s for operator pods to come up, 7 remaining. Waited 300s for operator pods to come up, 7 remaining. Waited 315s for operator pods to come up, 7 remaining. Waited 330s for operator pods to come up, 6 remaining. Waited 345s for operator pods to come up, 5 remaining. Waited 360s for operator pods to come up, 4 remaining. Waited 375s for operator pods to come up, 3 remaining. Waited 390s for operator pods to come up, 1 remaining. Waited 405s for operator pods to come up, 1 remaining. Waited 420s for operator pods to come up, 1 remaining. Operator pods ready. Waiting for service pods to become ready. Kubeflow could not be enabled: Error from server (NotFound): mutatingwebhookconfigurations.admissionregistration.k8s.io "katib-mutating-webhook-config" not found Error from server (NotFound): validatingwebhookconfigurations.admissionregistration.k8s.io "katib-validating-webhook-config" not found

Command '('microk8s-kubectl.wrapper', 'delete', 'mutatingwebhookconfigurations/katib-mutating-webhook-config', 'validatingwebhookconfigurations/katib-validating-webhook-config')' returned non-zero exit status 1 Failed to enable kubeflow

shrinidhisuresha commented 3 years ago

Did we find any solution for Error from server (NotFound): mutatingwebhookconfigurations.admissionregistration.k8s.io "katib-mutating-webhook-config" not found

Even im facing same issue

didier-durand commented 3 years ago

@knkski : please, let me know if I could supply additional to help your analysis of the cause. Thanks! Didier

knkski commented 3 years ago

@didier-durand: sorry about the slow response. #1635 should fix this issue, and ensure that it doesn't happen again. It will be included in 1.20/stable, or you can try out it before that by switching microk8s to the latest/edge (1.20) or latest/beta (1.19.5) channels.

knkski commented 3 years ago

I'm going to close this, since it should now be fixed, but feel free to reopen if you encounter the issue again.

didier-durand commented 3 years ago

@knkski : Thanks. No issue regarding delay. I'll test and come to tell you if fixed or not in my own Github workflow. Didier

didier-durand commented 3 years ago

Hi there,

after tests, I can confirm that microk8s enable kubeflow now runs successfully with a fresh Ubuntu install on a Google Cloud GCE instance n2-standard-8 with 250 GB hard disk.

It just takes some time for the 30+ operator pods (list below) to get ready: see below. Over 12 min to come up and get ready: 7min30s for the pods to come up and get ready. Then, 4min30s to get Congratulations, Kubeflow is now available.

snap is installed from latest/edge: see below.

GCE image: ubuntu-2004-focal-v20201111 - image family: ubuntu-2004-lts - image project: ubuntu-os-cloud

@ktsakalozos , @knkski : thanks for your support.

Didier

ddurand@microk8s-kubeflow:~$ snap list
Name              Version    Rev    Tracking         Publisher          Notes
core              16-2.47.1  10185  latest/stable    canonical✓         core
core18            20200929   1932   latest/stable    canonical✓         base
google-cloud-sdk  318.0.0    159    latest/stable/…  google-cloud-sdk✓  classic
lxd               4.0.4      18150  4.0/stable/…     canonical✓         -
microk8s          v1.19.4    1826   latest/edge      canonical✓         classic
snapd             2.47.1     9721   latest/stable    canonical✓         snapd
ddurand@microk8s-kubeflow:~$ microk8s enable kubeflow
Enabling dns...
Enabling storage...
Enabling dashboard...
Enabling ingress...
Enabling metallb:10.64.140.43-10.64.140.49...
Waiting for DNS and storage plugins to finish setting up
Bootstrapping...
Bootstrap complete.
Successfully bootstrapped, deploying...
Kubeflow deployed.
Waiting for operator pods to become ready.
Waited 0s for operator pods to come up, 31 remaining.
Waited 15s for operator pods to come up, 31 remaining.
Waited 30s for operator pods to come up, 31 remaining.
Waited 45s for operator pods to come up, 31 remaining.
Waited 60s for operator pods to come up, 30 remaining.
Waited 75s for operator pods to come up, 29 remaining.
Waited 90s for operator pods to come up, 28 remaining.
Waited 105s for operator pods to come up, 28 remaining.
Waited 120s for operator pods to come up, 27 remaining.
Waited 135s for operator pods to come up, 27 remaining.
Waited 150s for operator pods to come up, 27 remaining.
Waited 165s for operator pods to come up, 25 remaining.
Waited 180s for operator pods to come up, 21 remaining.
Waited 195s for operator pods to come up, 20 remaining.
Waited 210s for operator pods to come up, 20 remaining.
Waited 225s for operator pods to come up, 19 remaining.
Waited 240s for operator pods to come up, 18 remaining.
Waited 255s for operator pods to come up, 17 remaining.
Waited 270s for operator pods to come up, 14 remaining.
Waited 285s for operator pods to come up, 14 remaining.
Waited 300s for operator pods to come up, 14 remaining.
Waited 315s for operator pods to come up, 14 remaining.
Waited 330s for operator pods to come up, 14 remaining.
Waited 345s for operator pods to come up, 14 remaining.
Waited 360s for operator pods to come up, 14 remaining.
Waited 375s for operator pods to come up, 13 remaining.
Waited 390s for operator pods to come up, 11 remaining.
Waited 405s for operator pods to come up, 9 remaining.
Waited 420s for operator pods to come up, 3 remaining.
Waited 435s for operator pods to come up, 3 remaining.
Waited 450s for operator pods to come up, 2 remaining.
Operator pods ready.
Waiting for service pods to become ready.
Congratulations, Kubeflow is now available.

The dashboard is available at http://localhost

    Username: admin
    Password: 2CDOKXARFGPIGKP9GI1UZKN1GRI2KR

To see these values again, run:

    microk8s juju config dex-auth static-username
    microk8s juju config dex-auth static-password

To tear down Kubeflow and associated infrastructure, run:

    microk8s disable kubeflow
ddurand@microk8s-kubeflow:~$ microk8s kubectl get pods --all-namespaces
NAMESPACE         NAME                                           READY   STATUS    RESTARTS   AGE
kube-system       calico-node-r8vrx                              1/1     Running   1          19m
kube-system       coredns-86f78bb79c-7czsv                       1/1     Running   0          17m
kube-system       hostpath-provisioner-5c65fbdb4f-2sr75          1/1     Running   0          17m
kube-system       calico-kube-controllers-847c8c99d-pk4tg        1/1     Running   0          19m
kube-system       metrics-server-8bbfb4bdb-ks6h5                 1/1     Running   0          17m
kube-system       dashboard-metrics-scraper-6c4568dc68-gj74f     1/1     Running   0          17m
kube-system       kubernetes-dashboard-7ffd448895-925vr          1/1     Running   0          17m
metallb-system    controller-559b68bfd8-lgkdv                    1/1     Running   0          17m
metallb-system    speaker-p878g                                  1/1     Running   0          17m
ingress           nginx-ingress-microk8s-controller-jxmxz        1/1     Running   0          17m
controller-uk8s   controller-0                                   2/2     Running   2          16m
controller-uk8s   modeloperator-65c978c8b4-lpzhc                 1/1     Running   0          15m
kubeflow          modeloperator-68f4bcd86f-s7nz8                 1/1     Running   0          15m
kubeflow          argo-controller-operator-0                     1/1     Running   0          15m
kubeflow          argo-ui-operator-0                             1/1     Running   0          14m
kubeflow          dex-auth-operator-0                            1/1     Running   0          14m
kubeflow          jupyter-controller-operator-0                  1/1     Running   0          14m
kubeflow          jupyter-web-operator-0                         1/1     Running   0          13m
kubeflow          argo-ui-7dbc7569d5-rph55                       1/1     Running   0          14m
kubeflow          istio-ingressgateway-operator-0                1/1     Running   0          13m
kubeflow          istio-pilot-operator-0                         1/1     Running   0          13m
kubeflow          jupyter-controller-66fd84d549-pvb8k            1/1     Running   0          13m
kubeflow          istio-pilot-54db58d6ff-cs4m5                   1/1     Running   0          13m
kubeflow          katib-controller-operator-0                    1/1     Running   0          12m
kubeflow          kubeflow-profiles-operator-0                   1/1     Running   0          12m
kubeflow          pipelines-db-operator-0                        1/1     Running   0          12m
kubeflow          pipelines-visualization-operator-0             1/1     Running   0          12m
kubeflow          tf-job-operator-operator-0                     1/1     Running   0          12m
kubeflow          pytorch-operator-operator-0                    1/1     Running   0          12m
kubeflow          katib-manager-operator-0                       1/1     Running   0          12m
kubeflow          katib-controller-758c6b7b55-nwjbg              1/1     Running   0          12m
kubeflow          katib-ui-operator-0                            1/1     Running   0          11m
kubeflow          kubeflow-dashboard-operator-0                  1/1     Running   0          11m
kubeflow          metadata-api-operator-0                        1/1     Running   0          11m
kubeflow          pipelines-scheduledworkflow-operator-0         1/1     Running   0          11m
kubeflow          pipelines-ui-operator-0                        1/1     Running   0          11m
kubeflow          pipelines-viewer-operator-0                    1/1     Running   0          11m
kubeflow          seldon-core-operator-0                         1/1     Running   0          11m
kubeflow          metadata-db-operator-0                         1/1     Running   0          11m
kubeflow          metadata-envoy-operator-0                      1/1     Running   0          11m
kubeflow          metadata-ui-operator-0                         1/1     Running   0          10m
kubeflow          minio-operator-0                               1/1     Running   0          10m
kubeflow          oidc-gatekeeper-operator-0                     1/1     Running   0          10m
kubeflow          katib-db-operator-0                            1/1     Running   0          10m
kubeflow          metacontroller-operator-0                      1/1     Running   0          10m
kubeflow          metadata-grpc-operator-0                       1/1     Running   0          10m
kubeflow          pipelines-api-operator-0                       1/1     Running   0          10m
kubeflow          pipelines-persistence-operator-0               1/1     Running   0          10m
kubeflow          pipelines-visualization-9dfbbf684-mwm65        1/1     Running   0          12m
kubeflow          kubeflow-profiles-559799b56b-8hln6             2/2     Running   1          12m
kubeflow          tf-job-operator-6789f578b5-wdzhn               1/1     Running   0          12m
kubeflow          pipelines-db-0                                 1/1     Running   0          12m
kubeflow          pytorch-operator-d5d55685b-76rgl               1/1     Running   0          12m
kubeflow          katib-ui-7fd6f78898-ngs68                      1/1     Running   0          11m
kubeflow          pipelines-scheduledworkflow-7c7bb5c5fb-wbrhb   1/1     Running   0          11m
kubeflow          pipelines-viewer-9688dfbb9-5twnq               1/1     Running   0          11m
kubeflow          seldon-core-7799f4dcc4-x8q65                   1/1     Running   0          10m
kubeflow          kubeflow-dashboard-58f586fbb4-c6ckn            1/1     Running   0          10m
kubeflow          jupyter-web-85675688cd-4z62p                   2/2     Running   0          12m
kubeflow          metadata-db-0                                  1/1     Running   0          10m
kubeflow          katib-db-0                                     1/1     Running   0          8m55s
kubeflow          metadata-api-788886b5cd-ml8mq                  1/1     Running   0          9m21s
kubeflow          minio-0                                        1/1     Running   0          10m
kubeflow          metadata-grpc-85776d69d4-4qcw5                 1/1     Running   0          9m5s
kubeflow          metadata-ui-5658db6c4f-hlvpj                   1/1     Running   0          8m42s
kubeflow          metacontroller-7676b7895f-7whvz                1/1     Running   0          9m
kubeflow          argo-controller-587658cd67-nqwk7               1/1     Running   0          9m4s
kubeflow          pipelines-persistence-7fc85bb56-st5zz          1/1     Running   0          8m51s
kubeflow          metadata-envoy-758b684754-fkm4r                1/1     Running   0          8m36s
kubeflow          pipelines-ui-6dd6c5cf59-8rmlr                  2/2     Running   0          8m23s
kubeflow          pipelines-api-7cc457dbcc-rzmpc                 1/1     Running   0          8m1s
kubeflow          istio-ingressgateway-59c958ddf6-drz6z          1/1     Running   0          7m44s
kubeflow          katib-manager-65dfb98fcb-8thwf                 1/1     Running   0          7m35s
kubeflow          dex-auth-8687b86488-wzm8t                      2/2     Running   2          6m8s
kubeflow          oidc-gatekeeper-7566d4f667-j67nv               2/2     Running   0          6m27s
danudeep90 commented 3 years ago

@didier-durand :I installed and enabled kubeflow on a cloud virtual machine using sudo snap install microk8s --classic --channel=latest/edge microk8s.enable dns dashboard storage microk8s.enable kubeflow

I get a success message saying kubeflow dashboard available at http://localhost

I setup SOCKS proxy on port 9999 and able to open Kubeflow page using clusterIP in the services, but unable to access pipelines and notebook server page

Any idea how we can get notebook server and pipelines page working ?

didier-durand commented 3 years ago

@danudeep90 : I do not use SOCKS but regular port forwarding via kubectl, which works. Have a look at https://github.com/didier-durand/microk8s-akri to see how I use it (toward end of .sh)