aws-samples / amazon-eks-machine-learning-with-terraform-and-kubeflow

Distributed training using Kubeflow on Amazon EKS
Apache License 2.0
82 stars 42 forks source link

jupyter pod does not work #10

Closed oonisim closed 3 years ago

oonisim commented 4 years ago

The image value charts/maskrcnn/charts/jupyter/values.yaml needs to be specified otherwise the instruction fail.

$ helm install --debug maskrcnn ./maskrcnn/
install.go:159: [debug] Original chart version: ""
install.go:176: [debug] CHART PATH:$HOME/amazon-eks-machine-learning-with-terraform-and-kubeflow/charts/maskrcnn

client.go:108: [debug] creating 9 resource(s)
Error: Deployment.apps "jupyter" is invalid: spec.template.spec.containers[0].image: Required value
helm.go:84: [debug] Deployment.apps "jupyter" is invalid: spec.template.spec.containers[0].image: Required value

However, specifying the same docker image does not work with "OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"jupyter\": executable file not found in $PATH""

$ kubectl describe -n kubeflow pod/jupyter-8c746888b-bg2sz
Name:           jupyter-8c746888b-bg2sz
Namespace:      kubeflow
Priority:       0
Node:           ip-192-168-75-16.ap-southeast-2.compute.internal/192.168.75.16
Start Time:     Fri, 24 Jul 2020 15:44:03 +1000
Labels:         app=jupyter
                pod-template-hash=8c746888b
Annotations:    kubernetes.io/psp: eks.privileged
Status:         Running
IP:             192.168.94.236
IPs:            <none>
Controlled By:  ReplicaSet/jupyter-8c746888b
Containers:
  jupyter:
    Container ID:  docker://792268d3ed1a84e133dfeeea992113db5b9f530dbd61465d6731cd84e55d941b
    Image:         103365315157.dkr.ecr.ap-southeast-2.amazonaws.com/mask-rcnn-tensorpack:tf1.14-tp4ac2e22
    Image ID:      docker-pullable://103365315157.dkr.ecr.ap-southeast-2.amazonaws.com/mask-rcnn-tensorpack@sha256:b7e9455ef88def70fc9cd2d988da5aad9f4a4a92aab37eadc0ab9acc73311fea
    Port:          8888/TCP
    Host Port:     0/TCP
    Command:
      jupyter
    Args:
      lab
      --allow-root
      --no-browser
      --ip=0.0.0.0
      --certfile=/labs-cert.pem
      --keyfile=/labs-key.key
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"jupyter\": executable file not found in $PATH": unknown
      Exit Code:    127
      Started:      Fri, 24 Jul 2020 15:47:04 +1000
      Finished:     Fri, 24 Jul 2020 15:47:04 +1000
    Ready:          False
    Restart Count:  5
    Environment:    <none>
    Mounts:
      /efs from efs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-v2222 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  efs:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  tensorpack-efs-gp-bursting
    ReadOnly:   false
  default-token-v2222:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-v2222
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                    From                                                       Message
  ----     ------     ----                   ----                                                       -------
  Normal   Scheduled  4m56s                  default-scheduler                                          Successfully assigned kubeflow/jupyter-8c746888b-bg2sz to ip-192-168-75-16.ap-southeast-2.compute.internal
  Warning  Failed     4m10s (x4 over 4m54s)  kubelet, ip-192-168-75-16.ap-southeast-2.compute.internal  Error: failed to start container "jupyter": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"jupyter\": executable file not found in $PATH": unknown
  Warning  BackOff    3m31s (x6 over 4m51s)  kubelet, ip-192-168-75-16.ap-southeast-2.compute.internal  Back-off restarting failed container
  Normal   Pulling    3m17s (x5 over 4m55s)  kubelet, ip-192-168-75-16.ap-southeast-2.compute.internal  Pulling image "103365315157.dkr.ecr.ap-southeast-2.amazonaws.com/mask-rcnn-tensorpack:tf1.14-tp4ac2e22"
  Normal   Pulled     3m17s (x5 over 4m55s)  kubelet, ip-192-168-75-16.ap-southeast-2.compute.internal  Successfully pulled image "103365315157.dkr.ecr.ap-southeast-2.amazonaws.com/mask-rcnn-tensorpack:tf1.14-tp4ac2e22"
  Normal   Created    3m17s (x5 over 4m55s)  kubelet, ip-192-168-75-16.ap-southeast-2.compute.internal  Created container jupyter

I suppose Jupyer notebook image is required.

ShahabNaz commented 4 years ago

Hello, I have exactly the same problem with the same aws example. Have you found any solution for it? Thanks