canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
104 stars 50 forks source link

pipelines-api stuck in CrashLoopBackOff #185

Closed genet022 closed 4 years ago

genet022 commented 4 years ago

We're deploy and bootstrapping to a k8s cluster on AWS using Juju 2.7.4. Then I'm deploying kubeflow with juju deploy cs:bundle/kubeflow. All units go to 'active' according to Juju, but the pipeline-api App stays in 'waiting' and kubectl shows an issue with the pipeline-api pod. Specifically a failure to create the Minio bucket.

I'm using 8CPU, 16GB memory, 100GB storage for my 3 kubernetes-workers.

Here's some output from kubectl:

ubuntu@aws-cpe:~$ kubectl describe pod pipelines-api-5bd7b89ff8-6b4x6 --namespace controller-foundations-k8s
Name:         pipelines-api-5bd7b89ff8-6b4x6
Namespace:    controller-foundations-k8s
Priority:     0
Node:         ip-172-31-46-57.ec2.internal/172.31.46.57
Start Time:   Thu, 19 Mar 2020 17:00:41 +0000
Labels:       juju-app=pipelines-api
              pod-template-hash=5bd7b89ff8
Annotations:  apparmor.security.beta.kubernetes.io/pod: runtime/default
              juju.io/controller: 1763004b-11b9-453c-84d5-35efe0dbbc5a
              juju.io/model: 96868d80-461e-4110-862f-2803a36513d6
              juju.io/unit: pipelines-api/0
              seccomp.security.beta.kubernetes.io/pod: docker/default
Status:       Running
IP:           10.1.26.24
IPs:
  IP:           10.1.26.24
Controlled By:  ReplicaSet/pipelines-api-5bd7b89ff8
Init Containers:
  juju-pod-init:
    Container ID:  containerd://a841de81fb7cbdad37f3244e5057849f3225d8e666455c9e45e01b961a32635d
    Image:         jujusolutions/jujud-operator:2.7.4
    Image ID:      docker.io/jujusolutions/jujud-operator@sha256:aaa1920ddf1eeddf4bf1a2f5bd9cdbc452ad693393ef3dea0ec863ba92927167
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      export JUJU_DATA_DIR=/var/lib/juju
      export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools

      mkdir -p $JUJU_TOOLS_DIR
      cp /opt/jujud $JUJU_TOOLS_DIR/jujud
      initCmd=$($JUJU_TOOLS_DIR/jujud help commands | grep caas-unit-init)
      if test -n "$initCmd"; then
      $JUJU_TOOLS_DIR/jujud caas-unit-init --debug --wait;
      else
      exit 0
      fi

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 19 Mar 2020 17:00:44 +0000
      Finished:     Thu, 19 Mar 2020 17:01:15 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from pipelines-api-token-m2qw2 (ro)
Containers:
  pipelines-api:
    Container ID:   containerd://ae998843e52a3942cd07f0b4d6ac54dadb4fb3f6c520e8d8c41bafe55c6e3433
    Image:          registry.jujucharms.com/kubeflow-charmers/pipelines-api/oci-image@sha256:9ce417ed6e5a4c2ba2804d4a2694542b8a0cfc50e7a2cc9a0e08053cd06a41d8
    Image ID:       registry.jujucharms.com/kubeflow-charmers/pipelines-api/oci-image@sha256:9ce417ed6e5a4c2ba2804d4a2694542b8a0cfc50e7a2cc9a0e08053cd06a41d8
    Ports:          8887/TCP, 8888/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Thu, 19 Mar 2020 17:08:30 +0000
      Finished:     Thu, 19 Mar 2020 17:08:30 +0000
    Ready:          False
    Restart Count:  6
    Environment:
      MINIO_SERVICE_SERVICE_HOST:  minio
      MINIO_SERVICE_SERVICE_PORT:  9000
      MYSQL_SERVICE_HOST:          10.152.183.106
      MYSQL_SERVICE_PORT:          3306
      POD_NAMESPACE:               controller
    Mounts:
      /config from pipelines-api-config-config (rw)
      /samples from pipelines-api-samples-config (rw)
      /usr/bin/juju-run from juju-data-dir (rw,path="tools/jujud")
      /var/lib/juju from juju-data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from pipelines-api-token-m2qw2 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  juju-data-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  pipelines-api-config-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      pipelines-api-config-config
    Optional:  false
  pipelines-api-samples-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      pipelines-api-samples-config
    Optional:  false
  pipelines-api-token-m2qw2:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  pipelines-api-token-m2qw2
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                    From                                   Message
  ----     ------     ----                   ----                                   -------
  Normal   Scheduled  <unknown>              default-scheduler                      Successfully assigned controller-foundations-k8s/pipelines-api-5bd7b89ff8-6b4x6 to ip-172-31-46-57.ec2.internal
  Normal   Pulled     10m                    kubelet, ip-172-31-46-57.ec2.internal  Container image "jujusolutions/jujud-operator:2.7.4" already present on machine
  Normal   Created    10m                    kubelet, ip-172-31-46-57.ec2.internal  Created container juju-pod-init
  Normal   Started    10m                    kubelet, ip-172-31-46-57.ec2.internal  Started container juju-pod-init
  Normal   Pulling    10m                    kubelet, ip-172-31-46-57.ec2.internal  Pulling image "registry.jujucharms.com/kubeflow-charmers/pipelines-api/oci-image@sha256:9ce417ed6e5a4c2ba2804d4a2694542b8a0cfc50e7a2cc9a0e08053cd06a41d8"
  Normal   Pulled     8m55s                  kubelet, ip-172-31-46-57.ec2.internal  Successfully pulled image "registry.jujucharms.com/kubeflow-charmers/pipelines-api/oci-image@sha256:9ce417ed6e5a4c2ba2804d4a2694542b8a0cfc50e7a2cc9a0e08053cd06a41d8"
  Normal   Created    8m (x4 over 8m55s)     kubelet, ip-172-31-46-57.ec2.internal  Created container pipelines-api
  Normal   Started    7m59s (x4 over 8m55s)  kubelet, ip-172-31-46-57.ec2.internal  Started container pipelines-api
  Normal   Pulled     5m42s (x5 over 8m53s)  kubelet, ip-172-31-46-57.ec2.internal  Container image "registry.jujucharms.com/kubeflow-charmers/pipelines-api/oci-image@sha256:9ce417ed6e5a4c2ba2804d4a2694542b8a0cfc50e7a2cc9a0e08053cd06a41d8" already present on machine
  Warning  BackOff    33s (x40 over 8m52s)   kubelet, ip-172-31-46-57.ec2.internal  Back-off restarting failed container
ubuntu@aws-cpe:~$ kubectl logs -p pipelines-api-5bd7b89ff8-6b4x6 --namespace controller-foundations-k8s
I0319 17:18:40.236397       6 client_manager.go:123] Initializing client manager
[mysql] 2020/03/19 17:18:40 packets.go:427: busy buffer
[mysql] 2020/03/19 17:18:40 packets.go:408: busy buffer
E0319 17:18:40.462220       6 default_experiment_store.go:73] Failed to commit transaction to initialize default experiment table
[mysql] 2020/03/19 17:18:40 packets.go:427: busy buffer
[mysql] 2020/03/19 17:18:40 packets.go:408: busy buffer
E0319 17:18:40.468786       6 db_status_store.go:71] Failed to commit transaction to initialize database status table
[mysql] 2020/03/19 17:18:40 packets.go:427: busy buffer
[mysql] 2020/03/19 17:18:40 packets.go:408: busy buffer
E0319 17:18:40.474776       6 default_experiment_store.go:73] Failed to commit transaction to initialize default experiment table
F0319 17:18:40.483332       6 client_manager.go:291] Failed to create Minio bucket. Error: <nil>
goroutine 1 [running]:
github.com/golang/glog.stacks(0xc00039a500, 0xc00056c500, 0x61, 0x9b)
    external/com_github_golang_glog/glog.go:769 +0xd4
github.com/golang/glog.(*loggingT).output(0x2962840, 0xc000000003, 0xc0003cf130, 0x2842f30, 0x11, 0x123, 0x0)
    external/com_github_golang_glog/glog.go:720 +0x329
github.com/golang/glog.(*loggingT).printf(0x2962840, 0xc000000003, 0x1aafa44, 0x28, 0xc00071fb00, 0x1, 0x1)
    external/com_github_golang_glog/glog.go:655 +0x14b
github.com/golang/glog.Fatalf(0x1aafa44, 0x28, 0xc00071fb00, 0x1, 0x1)
    external/com_github_golang_glog/glog.go:1148 +0x67
main.createMinioBucket(0xc0000ee820, 0xc000468960, 0xa)
    backend/src/apiserver/client_manager.go:291 +0x221
main.initMinioClient(0x12a05f200, 0x15, 0x12a05f200)
    backend/src/apiserver/client_manager.go:277 +0x1e5
main.(*ClientManager).init(0xc00071fcd8)
    backend/src/apiserver/client_manager.go:140 +0x3de
main.newClientManager(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    backend/src/apiserver/client_manager.go:300 +0x7b
main.main()
    backend/src/apiserver/main.go:56 +0x5e
knkski commented 4 years ago

A bare juju deploy cs:bundle/kubeflow won't work, unfortunately. You'll need to either use the script from this repo, or manually run the same steps yourself:

If you're using the stable kubeflow bundle: https://github.com/juju-solutions/bundle-kubeflow/blob/e9a2775/scripts/cli.py

If you're using the edge kubeflow bundle: https://github.com/juju-solutions/bundle-kubeflow/blob/master/scripts/cli.py

genet022 commented 4 years ago

I'm manually running the same steps and getting the same issue. Here's exactly what I'm doing:

juju add-model kubeflow
juju deploy -m kubeflow kubeflow --channel stable
juju wait -m kubeflow
pub_addr = Public address of kubernetes-worker/0
juju config -m kubeflow ambassador juju-external-hostname={pub_addr}
juju expose -m kubeflow ambassador

Then I'm applying an Ingress:

# https://bugs.launchpad.net/juju/+bug/1849725
# create ambassador_ingress.yaml
kind: Ingress
apiVersion: extensions/v1beta1
metadata: 
  name: ambassador
  namespace: kubeflow
spec:
  backend:
    serviceName: ingresssvc
    servicePort: 80
  tls:
  - hosts: 
    - {pub_addr}
    secretName: ambassador-tls

kubectl apply -n kubeflow -f ambassador_ingress.yaml

I'm curious if it is related to this issue. Thoughts? https://github.com/kubeflow/pipelines/issues/3098

knkski commented 4 years ago

The only thing I see that's missing is that you'll need to add an overlay that looks like this, and pass it to the juju deploy command:

https://github.com/juju-solutions/bundle-kubeflow/blob/e9a2775/scripts/cli.py#L220-L228

You'll need to fill in the random values yourself (any random value should work)

genet022 commented 4 years ago

That did it. Thanks again for the help! Closing the issue.