canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
102 stars 49 forks source link

Multiple pods in crashloopbackoff #158

Closed camille-rodriguez closed 4 years ago

camille-rodriguez commented 4 years ago

Hello, I deployed a ubuntu 18.04 VM with 8GB ram, 40GB disk space. On it I deployed the juju snaps, microk8s, and the latest bundle-kubeflow. Multiple pods are not coming online and present various errors. It seems like a recurrent issue.

ubuntu@ai-demo:~$ m get pods -n kubeflow
NAME                                           READY   STATUS             RESTARTS   AGE
ambassador-79574bd65b-654hx                    1/1     Running            0          57m
ambassador-operator-0                          1/1     Running            0          58m
argo-controller-64cc95f77b-nl6s2               1/1     Running            0          50m
argo-controller-operator-0                     1/1     Running            0          58m
argo-ui-84d8b568d8-krm7n                       1/1     Running            0          57m
argo-ui-operator-0                             1/1     Running            0          57m
jupyter-controller-7bccb55f46-sjl97            1/1     Running            0          57m
jupyter-controller-operator-0                  1/1     Running            0          57m
jupyter-web-9dc84c45b-mx9fr                    1/1     Running            0          57m
jupyter-web-operator-0                         1/1     Running            0          57m
katib-controller-56dd5bf95b-s45s5              1/1     Running            0          55m
katib-controller-operator-0                    1/1     Running            0          57m
katib-db-0                                     1/1     Running            1          56m
katib-db-operator-0                            1/1     Running            0          57m
katib-manager-5d6cc65b8c-vmhjc                 0/1     CrashLoopBackOff   13         48m
katib-manager-operator-0                       1/1     Running            0          57m
katib-ui-76974795f9-7r85z                      1/1     Running            0          56m
katib-ui-operator-0                            1/1     Running            0          57m
kubeflow-dashboard-757c877956-jclqd            1/1     Running            0          50m
kubeflow-dashboard-operator-0                  1/1     Running            0          57m
kubeflow-gatekeeper-6f9fcf8c55-gcdfw           1/1     Running            0          54m
kubeflow-gatekeeper-operator-0                 1/1     Running            0          57m
kubeflow-login-97d55d69f-9vzhg                 1/1     Running            0          55m
kubeflow-login-operator-0                      1/1     Running            0          56m
kubeflow-profiles-57fd5c6d78-6fzqm             2/2     Running            0          54m
kubeflow-profiles-operator-0                   1/1     Running            0          56m
metacontroller-5ccc9b744d-sw49n                1/1     Running            0          54m
metacontroller-operator-0                      1/1     Running            0          56m
metadata-controller-7f94875696-s24tr           0/1     CrashLoopBackOff   5          47m
metadata-controller-operator-0                 1/1     Running            0          56m
metadata-db-0                                  1/1     Running            1          54m
metadata-db-operator-0                         1/1     Running            0          56m
metadata-ui-58bdd9b6bc-ntzjk                   1/1     Running            0          50m
metadata-ui-operator-0                         1/1     Running            0          56m
minio-0                                        1/1     Running            0          54m
minio-operator-0                               1/1     Running            0          55m
modeldb-backend-797f77c488-9vrkz               1/2     Error              5          44m
modeldb-backend-operator-0                     1/1     Running            0          55m
modeldb-db-0                                   1/1     Running            1          54m
modeldb-db-operator-0                          1/1     Running            0          55m
modeldb-store-fbf49bdf8-7mlqh                  1/1     Running            0          54m
modeldb-store-operator-0                       1/1     Running            0          55m
modeldb-ui-78b6dd66b8-zwpdf                    1/1     Running            0          45m
modeldb-ui-operator-0                          1/1     Running            0          55m
pipelines-api-6c6f459c98-q2grc                 0/1     CrashLoopBackOff   9          44m
pipelines-api-operator-0                       1/1     Running            0          54m
pipelines-db-0                                 1/1     Running            1          53m
pipelines-db-operator-0                        1/1     Running            0          53m
pipelines-persistence-664c75f577-95x7d         0/1     CrashLoopBackOff   7          48m
pipelines-persistence-operator-0               1/1     Running            0          53m
pipelines-scheduledworkflow-79cc64c7c4-n4nwm   0/1     Init:0/1           0          50m
pipelines-scheduledworkflow-operator-0         1/1     Running            0          53m
pipelines-ui-867bd9ccf4-7fkwv                  1/1     Running            0          44m
pipelines-ui-operator-0                        1/1     Running            0          53m
pipelines-viewer-74c4f8bcd-rzxml               1/1     Running            0          50m
pipelines-viewer-operator-0                    1/1     Running            0          52m
pytorch-operator-79b6bf8d4c-bz2bt              1/1     Running            0          49m
pytorch-operator-operator-0                    1/1     Running            0          52m
redis-5b9c9c4b45-ctcc7                         1/1     Running            0          48m
redis-operator-0                               1/1     Running            0          52m
seldon-api-frontend-74cbb778cc-t4w4g           1/1     Running            0          45m
seldon-api-frontend-operator-0                 1/1     Running            0          52m
seldon-cluster-manager-747565f949-47k9z        1/1     Running            1          45m
seldon-cluster-manager-operator-0              1/1     Running            0          52m
tensorboard-85b6fc699f-vj9fg                   1/1     Running            0          49m
tensorboard-operator-0                         1/1     Running            0          52m
tf-job-dashboard-9b5b659bb-x8cqm               1/1     Running            0          49m
tf-job-dashboard-operator-0                    1/1     Running            0          52m
tf-job-operator-6bc8cb454c-gwxls               1/1     Running            0          48m
tf-job-operator-operator-0                     1/1     Running            0          52m

First pod in problem is the katib-manager. It seems like a readiness issue, the port isn't responding.

ubuntu@ai-demo:~/bundle-kubeflow$ m describe pod katib-manager-5d6cc65b8c-vmhjc -n kubeflow
Name:         katib-manager-5d6cc65b8c-vmhjc
Namespace:    kubeflow
Priority:     0
Node:         ai-demo/10.180.213.139
Start Time:   Tue, 03 Dec 2019 16:29:50 -0600
Labels:       juju-app=katib-manager
              pod-template-hash=5d6cc65b8c
Annotations:  apparmor.security.beta.kubernetes.io/pod: runtime/default
              juju.io/controller: e1a8ad96-a251-47a5-8c33-3b8a8fe66bbd
              juju.io/model: b0ee3e19-b70e-49a4-81c4-fb9e4242b426
              juju.io/unit: katib-manager/0
              seccomp.security.beta.kubernetes.io/pod: docker/default
Status:       Running
IP:           10.1.44.71
IPs:
  IP:           10.1.44.71
Controlled By:  ReplicaSet/katib-manager-5d6cc65b8c
Init Containers:
  juju-pod-init:
    Container ID:  containerd://5f015946a8c76d221fba8ac787b67f2dac9581f9f79105469118ed2403a22c1b
    Image:         jujusolutions/jujud-operator:2.7.0
    Image ID:      docker.io/jujusolutions/jujud-operator@sha256:375eee66a4a7af6128cb84c32a94a1abeffa4f4872e063ba935296701776b5e5
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      export JUJU_DATA_DIR=/var/lib/juju
      export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools

      mkdir -p $JUJU_TOOLS_DIR
      cp /opt/jujud $JUJU_TOOLS_DIR/jujud
      initCmd=$($JUJU_TOOLS_DIR/jujud help commands | grep caas-unit-init)
      if test -n "$initCmd"; then
      $JUJU_TOOLS_DIR/jujud caas-unit-init --debug --wait;
      else
      exit 0
      fi

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 03 Dec 2019 16:30:15 -0600
      Finished:     Tue, 03 Dec 2019 16:31:38 -0600
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-j9wx5 (ro)
Containers:
  katib-manager:
    Container ID:  containerd://0e46f446f24c1c64d109308fd4b64fcbd348462993ceb8493be4bc7c4d2ca1af
    Image:         registry.jujucharms.com/kubeflow-charmers/katib-manager/oci-image@sha256:28dddef61f71a8e8de0999c67ec60c38d2c1a91d6e24b96ec1e5ba4401add07e
    Image ID:      registry.jujucharms.com/kubeflow-charmers/katib-manager/oci-image@sha256:28dddef61f71a8e8de0999c67ec60c38d2c1a91d6e24b96ec1e5ba4401add07e
    Port:          6789/TCP
    Host Port:     0/TCP
    Command:
      ./katib-manager
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 03 Dec 2019 17:20:48 -0600
      Finished:     Tue, 03 Dec 2019 17:21:27 -0600
    Ready:          False
    Restart Count:  15
    Liveness:       exec [/bin/grpc_health_probe -addr=:6789] delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:      exec [/bin/grpc_health_probe -addr=:6789] delay=5s timeout=1s period=60s #success=1 #failure=5
    Environment:
      DB_NAME:      mysql
      DB_PASSWORD:  TW6VQALVDQ41NSQLF4EMG32D0F568T
      MYSQL_HOST:   10.152.183.67
      MYSQL_PORT:   3306
    Mounts:
      /usr/bin/juju-run from juju-data-dir (rw,path="tools/jujud")
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-j9wx5 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  juju-data-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  default-token-j9wx5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-j9wx5
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  <unknown>             default-scheduler  Successfully assigned kubeflow/katib-manager-5d6cc65b8c-vmhjc to ai-demo
  Normal   Pulled     54m                   kubelet, ai-demo   Container image "jujusolutions/jujud-operator:2.7.0" already present on machine
  Normal   Created    54m                   kubelet, ai-demo   Created container juju-pod-init
  Normal   Started    53m                   kubelet, ai-demo   Started container juju-pod-init
  Normal   Pulling    52m                   kubelet, ai-demo   Pulling image "registry.jujucharms.com/kubeflow-charmers/katib-manager/oci-image@sha256:28dddef61f71a8e8de0999c67ec60c38d2c1a91d6e24b96ec1e5ba4401add07e"
  Normal   Pulled     38m                   kubelet, ai-demo   Successfully pulled image "registry.jujucharms.com/kubeflow-charmers/katib-manager/oci-image@sha256:28dddef61f71a8e8de0999c67ec60c38d2c1a91d6e24b96ec1e5ba4401add07e"
  Warning  Unhealthy  36m (x2 over 37m)     kubelet, ai-demo   Readiness probe failed: timeout: failed to connect service ":6789" within 1s
  Normal   Created    36m (x3 over 38m)     kubelet, ai-demo   Created container katib-manager
  Normal   Started    36m (x3 over 38m)     kubelet, ai-demo   Started container katib-manager
  Warning  Unhealthy  34m (x17 over 37m)    kubelet, ai-demo   Liveness probe failed: timeout: failed to connect service ":6789" within 1s
  Warning  BackOff    13m (x78 over 33m)    kubelet, ai-demo   Back-off restarting failed container
  Normal   Killing    9m1s (x14 over 37m)   kubelet, ai-demo   Container katib-manager failed liveness probe, will be restarted
  Normal   Pulled     3m58s (x14 over 37m)  kubelet, ai-demo   Container image "registry.jujucharms.com/kubeflow-charmers/katib-manager/oci-image@sha256:28dddef61f71a8e8de0999c67ec60c38d2c1a91d6e24b96ec1e5ba4401add07e" already present on machine

Second one is the pod metadata-controller that shows an error when connecting to the mariadb server

ubuntu@ai-demo:~/bundle-kubeflow$ m describe pod metadata-controller-7f94875696-s24tr -n kubeflow
Name:         metadata-controller-7f94875696-s24tr
Namespace:    kubeflow
Priority:     0
Node:         ai-demo/10.180.213.139
Start Time:   Tue, 03 Dec 2019 16:30:26 -0600
Labels:       juju-app=metadata-controller
              pod-template-hash=7f94875696
Annotations:  apparmor.security.beta.kubernetes.io/pod: runtime/default
              juju.io/controller: e1a8ad96-a251-47a5-8c33-3b8a8fe66bbd
              juju.io/model: b0ee3e19-b70e-49a4-81c4-fb9e4242b426
              juju.io/unit: metadata-controller/0
              seccomp.security.beta.kubernetes.io/pod: docker/default
Status:       Running
IP:           10.1.44.72
IPs:
  IP:           10.1.44.72
Controlled By:  ReplicaSet/metadata-controller-7f94875696
Init Containers:
  juju-pod-init:
    Container ID:  containerd://13c5bfd0b5ce2d9d45b281a70f11aa5eb5c99aabe4a6360c629ac901de7861ef
    Image:         jujusolutions/jujud-operator:2.7.0
    Image ID:      docker.io/jujusolutions/jujud-operator@sha256:375eee66a4a7af6128cb84c32a94a1abeffa4f4872e063ba935296701776b5e5
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      export JUJU_DATA_DIR=/var/lib/juju
      export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools

      mkdir -p $JUJU_TOOLS_DIR
      cp /opt/jujud $JUJU_TOOLS_DIR/jujud
      initCmd=$($JUJU_TOOLS_DIR/jujud help commands | grep caas-unit-init)
      if test -n "$initCmd"; then
      $JUJU_TOOLS_DIR/jujud caas-unit-init --debug --wait;
      else
      exit 0
      fi

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 03 Dec 2019 16:30:44 -0600
      Finished:     Tue, 03 Dec 2019 16:32:55 -0600
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-j9wx5 (ro)
Containers:
  metadata:
    Container ID:  containerd://a2fc9241a7dae1bf22bb3285e71541eca020c58fec928c8aa24a39073e07635e
    Image:         registry.jujucharms.com/kubeflow-charmers/metadata-controller/oci-image@sha256:f2a0756e9c41f10cbd178e420e37ef0aaa5d60bbed34300a66b1c99745838d36
    Image ID:      registry.jujucharms.com/kubeflow-charmers/metadata-controller/oci-image@sha256:f2a0756e9c41f10cbd178e420e37ef0aaa5d60bbed34300a66b1c99745838d36
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      ./server/server
      --http_port=8080
      --mysql_service_host=10.152.183.188
      --mysql_service_port=3306
      --mysql_service_user=root
      --mysql_service_password=root
      --mlmd_db_name=metadb
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Wed, 04 Dec 2019 08:51:20 -0600
      Finished:     Wed, 04 Dec 2019 08:51:20 -0600
    Ready:          False
    Restart Count:  188
    Environment:
      MYSQL_ROOT_PASSWORD:  root
    Mounts:
      /usr/bin/juju-run from juju-data-dir (rw,path="tools/jujud")
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-j9wx5 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  juju-data-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  default-token-j9wx5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-j9wx5
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                   From              Message
  ----     ------   ----                  ----              -------
  Normal   Pulled   21m (x184 over 15h)   kubelet, ai-demo  Container image "registry.jujucharms.com/kubeflow-charmers/metadata-controller/oci-image@sha256:f2a0756e9c41f10cbd178e420e37ef0aaa5d60bbed34300a66b1c99745838d36" already present on machine
  Warning  BackOff  85s (x4287 over 15h)  kubelet, ai-demo  Back-off restarting failed container
ubuntu@ai-demo:~/bundle-kubeflow$
ubuntu@ai-demo:~/bundle-kubeflow$ m logs metadata-controller-7f94875696-s24tr -n kubeflow
F1204 14:51:20.776472       1 main.go:90] Failed to create ML Metadata Store with config mysql:<host:"10.152.183.188" port:3306 database:"metadb" user:"root" password:"root" > : mysql_real_connect failed: errno: 1130, error: Host '10.1.44.1' is not allowed to connect to this MariaDB server.
goroutine 1 [running]:
github.com/golang/glog.stacks(0xc000135100, 0xc0001b8000, 0x124, 0x17a)
    external/com_github_golang_glog/glog.go:769 +0xb1
github.com/golang/glog.(*loggingT).output(0x1633360, 0xc000000003, 0xc0001af2d0, 0x14eadd3, 0x7, 0x5a, 0x0)
    external/com_github_golang_glog/glog.go:720 +0x2f6
github.com/golang/glog.(*loggingT).printf(0x1633360, 0x3, 0xf6fee1, 0x37, 0xc00019be30, 0x2, 0x2)
    external/com_github_golang_glog/glog.go:655 +0x14e
github.com/golang/glog.Fatalf(...)
    external/com_github_golang_glog/glog.go:1148
main.mlmdStoreOrDie(0x0)
    server/main.go:90 +0x1c3
main.main()
    server/main.go:101 +0xe0

Third is the modeldb-backend

ubuntu@ai-demo:~/bundle-kubeflow$ m describe pod modeldb-backend-797f77c488-9vrkz -n kubeflow
Name:         modeldb-backend-797f77c488-9vrkz
Namespace:    kubeflow
Priority:     0
Node:         ai-demo/10.180.213.139
Start Time:   Tue, 03 Dec 2019 16:33:37 -0600
Labels:       juju-app=modeldb-backend
              pod-template-hash=797f77c488
Annotations:  apparmor.security.beta.kubernetes.io/pod: runtime/default
              juju.io/controller: e1a8ad96-a251-47a5-8c33-3b8a8fe66bbd
              juju.io/model: b0ee3e19-b70e-49a4-81c4-fb9e4242b426
              juju.io/unit: modeldb-backend/0
              seccomp.security.beta.kubernetes.io/pod: docker/default
Status:       Running
IP:           10.1.44.77
IPs:
  IP:           10.1.44.77
Controlled By:  ReplicaSet/modeldb-backend-797f77c488
Init Containers:
  juju-pod-init:
    Container ID:  containerd://548ecc057d06bcf349b64d920e905676696137fd44bcc43cc152ead0c69f1f16
    Image:         jujusolutions/jujud-operator:2.7.0
    Image ID:      docker.io/jujusolutions/jujud-operator@sha256:375eee66a4a7af6128cb84c32a94a1abeffa4f4872e063ba935296701776b5e5
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      export JUJU_DATA_DIR=/var/lib/juju
      export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools

      mkdir -p $JUJU_TOOLS_DIR
      cp /opt/jujud $JUJU_TOOLS_DIR/jujud
      initCmd=$($JUJU_TOOLS_DIR/jujud help commands | grep caas-unit-init)
      if test -n "$initCmd"; then
      $JUJU_TOOLS_DIR/jujud caas-unit-init --debug --wait;
      else
      exit 0
      fi

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 03 Dec 2019 16:33:40 -0600
      Finished:     Tue, 03 Dec 2019 16:34:42 -0600
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-j9wx5 (ro)
Containers:
  modeldb-backend:
    Container ID:  containerd://3a4251c4c8357c53159935efd7346a982500c6e4c05596cd28f65956687d7e99
    Image:         registry.jujucharms.com/kubeflow-charmers/modeldb-backend/oci-image@sha256:67e70b991598fe8fca12058e2cee1abc342ab26a0047ec4779cb6d8483d87161
    Image ID:      registry.jujucharms.com/kubeflow-charmers/modeldb-backend/oci-image@sha256:67e70b991598fe8fca12058e2cee1abc342ab26a0047ec4779cb6d8483d87161
    Port:          8085/TCP
    Host Port:     0/TCP
    Command:
      bash
    Args:
      -c
      ./wait-for-it.sh 10.152.183.242:3306 --timeout=10 && java -jar modeldb-1.0-SNAPSHOT-client-build.jar 
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 04 Dec 2019 08:52:30 -0600
      Finished:     Wed, 04 Dec 2019 08:52:36 -0600
    Ready:          False
    Restart Count:  182
    Environment:
      VERTA_MODELDB_CONFIG:  /config-backend/config.yaml
    Mounts:
      /config-backend/ from modeldb-backend-config-config (rw)
      /usr/bin/juju-run from juju-data-dir (rw,path="tools/jujud")
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-j9wx5 (ro)
  modeldb-backend-proxy:
    Container ID:  containerd://bbe3449a4f54917085882a13b587f6446d9fe639a917b2639b352a750f3664af
    Image:         vertaaiofficial/modeldb-backend-proxy:kubeflow
    Image ID:      docker.io/vertaaiofficial/modeldb-backend-proxy@sha256:5e21c2f82df9b05f7309772dd2be946a8ef24ba43bd2579aa6af22c4827c9205
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /go/bin/proxy
    Args:
      -project_endpoint
      localhost:8085
      -experiment_endpoint
      localhost:8085
      -experiment_run_endpoint
      localhost:8085
    State:          Running
      Started:      Tue, 03 Dec 2019 17:14:40 -0600
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /usr/bin/juju-run from juju-data-dir (rw,path="tools/jujud")
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-j9wx5 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  juju-data-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  modeldb-backend-config-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      modeldb-backend-config-config
    Optional:  false
  default-token-j9wx5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-j9wx5
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                     From              Message
  ----     ------   ----                    ----              -------
  Warning  BackOff  4m42s (x4143 over 15h)  kubelet, ai-demo  Back-off restarting failed container

ubuntu@ai-demo:~/bundle-kubeflow$ m logs modeldb-backend-797f77c488-9vrkz -n kubeflow
Error from server (BadRequest): a container name must be specified for pod modeldb-backend-797f77c488-9vrkz, choose one of: [modeldb-backend modeldb-backend-proxy] or one of the init containers: [juju-pod-init]
ubuntu@ai-demo:~/bundle-kubeflow$

And last one is the pipelines api pod

ubuntu@ai-demo:~/bundle-kubeflow$ m describe pod pipelines-api-6c6f459c98-q2grc -n kubeflow
Name:         pipelines-api-6c6f459c98-q2grc
Namespace:    kubeflow
Priority:     0
Node:         ai-demo/10.180.213.139
Start Time:   Tue, 03 Dec 2019 16:33:54 -0600
Labels:       juju-app=pipelines-api
              pod-template-hash=6c6f459c98
Annotations:  apparmor.security.beta.kubernetes.io/pod: runtime/default
              juju.io/controller: e1a8ad96-a251-47a5-8c33-3b8a8fe66bbd
              juju.io/model: b0ee3e19-b70e-49a4-81c4-fb9e4242b426
              juju.io/unit: pipelines-api/0
              seccomp.security.beta.kubernetes.io/pod: docker/default
Status:       Running
IP:           10.1.44.78
IPs:
  IP:           10.1.44.78
Controlled By:  ReplicaSet/pipelines-api-6c6f459c98
Init Containers:
  juju-pod-init:
    Container ID:  containerd://22633e8e541e6bfcfe78f6de032dac424267c956275f0d2748b4a816a9e99266
    Image:         jujusolutions/jujud-operator:2.7.0
    Image ID:      docker.io/jujusolutions/jujud-operator@sha256:375eee66a4a7af6128cb84c32a94a1abeffa4f4872e063ba935296701776b5e5
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      export JUJU_DATA_DIR=/var/lib/juju
      export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools

      mkdir -p $JUJU_TOOLS_DIR
      cp /opt/jujud $JUJU_TOOLS_DIR/jujud
      initCmd=$($JUJU_TOOLS_DIR/jujud help commands | grep caas-unit-init)
      if test -n "$initCmd"; then
      $JUJU_TOOLS_DIR/jujud caas-unit-init --debug --wait;
      else
      exit 0
      fi

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 03 Dec 2019 16:33:56 -0600
      Finished:     Tue, 03 Dec 2019 16:34:43 -0600
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from pipelines-api-token-tzmx5 (ro)
Containers:
  pipelines-api:
    Container ID:   containerd://6b215518d4010e935e62b580d49d774dc59a95437f490fe4b7c5ecb1e0c30eb9
    Image:          registry.jujucharms.com/kubeflow-charmers/pipelines-api/oci-image@sha256:9ce417ed6e5a4c2ba2804d4a2694542b8a0cfc50e7a2cc9a0e08053cd06a41d8
    Image ID:       registry.jujucharms.com/kubeflow-charmers/pipelines-api/oci-image@sha256:9ce417ed6e5a4c2ba2804d4a2694542b8a0cfc50e7a2cc9a0e08053cd06a41d8
    Ports:          8887/TCP, 8888/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Wed, 04 Dec 2019 08:53:44 -0600
      Finished:     Wed, 04 Dec 2019 08:53:52 -0600
    Ready:          False
    Restart Count:  188
    Environment:
      MINIO_SERVICE_SERVICE_HOST:  minio
      MINIO_SERVICE_SERVICE_PORT:  9000
      MYSQL_SERVICE_HOST:          10.152.183.128
      MYSQL_SERVICE_PORT:          3306
      POD_NAMESPACE:               kubeflow
    Mounts:
      /config from pipelines-api-config-config (rw)
      /samples from pipelines-api-samples-config (rw)
      /usr/bin/juju-run from juju-data-dir (rw,path="tools/jujud")
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from pipelines-api-token-tzmx5 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  juju-data-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  pipelines-api-config-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      pipelines-api-config-config
    Optional:  false
  pipelines-api-samples-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      pipelines-api-samples-config
    Optional:  false
  pipelines-api-token-tzmx5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  pipelines-api-token-tzmx5
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                   From              Message
  ----     ------   ----                  ----              -------
  Warning  BackOff  22s (x4287 over 16h)  kubelet, ai-demo  Back-off restarting failed container

ubuntu@ai-demo:~/bundle-kubeflow$ m logs pipelines-api-6c6f459c98-q2grc -n kubeflow
I1204 14:53:44.834007       6 client_manager.go:123] Initializing client manager
F1204 14:53:52.940333       6 error.go:296] commands out of sync. Did you run multiple statements at once?
goroutine 1 [running]:
github.com/golang/glog.stacks(0xc000199000, 0xc0001dcc80, 0x6b, 0x9b)
    external/com_github_golang_glog/glog.go:769 +0xd4
github.com/golang/glog.(*loggingT).output(0x2962840, 0xc000000003, 0xc0001ef130, 0x2844815, 0x8, 0x128, 0x0)
    external/com_github_golang_glog/glog.go:720 +0x329
github.com/golang/glog.(*loggingT).printf(0x2962840, 0x3, 0x1a7dc3d, 0x2, 0xc0006139b0, 0x1, 0x1)
    external/com_github_golang_glog/glog.go:655 +0x14b
github.com/golang/glog.Fatalf(0x1a7dc3d, 0x2, 0xc0006139b0, 0x1, 0x1)
    external/com_github_golang_glog/glog.go:1148 +0x67
github.com/kubeflow/pipelines/backend/src/common/util.TerminateIfError(0x1c2fb80, 0xc000442830)
    backend/src/common/util/error.go:296 +0x79
main.initMysql(0xc00035f64a, 0x5, 0x12a05f200, 0x0, 0x0)
    backend/src/apiserver/client_manager.go:260 +0x37d
main.initDBClient(0x12a05f200, 0x15)
    backend/src/apiserver/client_manager.go:190 +0x5c0
main.(*ClientManager).init(0xc000613cd8)
    backend/src/apiserver/client_manager.go:125 +0x80
main.newClientManager(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    backend/src/apiserver/client_manager.go:300 +0x7b
main.main()
    backend/src/apiserver/main.go:56 +0x5e
ubuntu@ai-demo:~/bundle-kubeflow$ 
knkski commented 4 years ago

It looks like this is related to running inside of multipass. I'm able to reproduce it, investigating as to why that is.

rlig commented 4 years ago

It still does not work even with 20GB of RAM

knkski commented 4 years ago

@rlig: If this is on microk8s, can you run microk8s.inspect and open an issue over in https://github.com/ubuntu/microk8s/?