everpeace / kube-openmpi

Open MPI jobs on Kubernetes
Apache License 2.0
112 stars 25 forks source link

issue with cluster deployment #25

Closed zakiournani closed 6 years ago

zakiournani commented 6 years ago

Hi, Am having an issue with deploying the master pod and run the workers

kubectl get pods NAME READY STATUS RESTARTS AGE coco-master 0/2 Init:CrashLoopBackOff 40 3h coco-worker-0 1/1 Running 0 3h coco-worker-1 1/1 Running 0 3h coco-worker-2 1/1 Running 0 3h

the values file is default

kubectl describe pod/coco-master -n default

Name: coco-master Namespace: default Node: minikube/10.0.2.15 Start Time: Mon, 10 Sep 2018 10:13:16 +0200 Labels: app=kube-openmpi chart=kube-openmpi-0.7.0 heritage=Tiller release=coco role=master Annotations: Status: Pending IP: 172.17.0.10 Init Containers: hostfile-initializer: Container ID: docker://9ae5b8f4c13473e5e5a102f462e961b027be9468c79ae5f83e33b59ca4084dc6 Image: everpeace/kubectl:1.9.2 Image ID: docker-pullable://everpeace/kubectl@sha256:d9b8948a360b5d1c27d2ebe3c1b803c4889d353429e6da70f4a2b66563c53bf4 Port: Host Port: Command: sh -c /kube-openmpi/utils/gen_hostfile.sh $HOSTFILE_DIR/hostfile

State:          Waiting
  Reason:       CrashLoopBackOff
Last State:     Terminated
  Reason:       Error
  Exit Code:    1
  Started:      Mon, 10 Sep 2018 10:44:23 +0200
  Finished:     Mon, 10 Sep 2018 10:44:23 +0200
Ready:          False
Restart Count:  11
Environment:
  HOSTFILE_DIR:  /kube-openmpi/generated
Mounts:
  /kube-openmpi/generated from kube-openmpi-hostfile-dir (rw)
  /kube-openmpi/utils from kube-openmpi-utils (rw)
  /var/run/secrets/kubernetes.io/serviceaccount from default-token-dzhst (ro)

Containers: mpi-master: Container ID:
Image: everpeace/kube-openmpi:0.7.0 Image ID:
Port: 2022/TCP Host Port: 0/TCP State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: HOSTFILE: /kube-openmpi/generated/hostfile GUILLOTINE: /kube-openmpi/guillotine Mounts: /kube-openmpi/generated/ from kube-openmpi-hostfile-dir (rw) /kube-openmpi/guillotine from kube-openmpi-guillotine (rw) /kube-openmpi/utils from kube-openmpi-utils (rw) /ssh-key/openmpi from kube-openmpi-ssh-key (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-dzhst (ro) hostfile-updater: Container ID:
Image: everpeace/kubectl:1.9.2 Image ID:
Port: Host Port: Command: sh -c while [ ! -e $GUILLOTINE/execute ]; do /kube-openmpi/utils/gen_hostfile.sh $HOSTFILE_DIR/hostfile 1 if [ -e /kube-openmpi/hostfile-updater-params/update_every ]; then SLEEP=$(cat /kube-openmpi/hostfile-updater-params/update_every) fi sleep ${SLEEP:-15} done echo Done.

State:          Waiting
  Reason:       PodInitializing
Ready:          False
Restart Count:  0
Environment:
  HOSTFILE_DIR:  /kube-openmpi/generated
  GUILLOTINE:    /kube-openmpi/guillotine
Mounts:
  /kube-openmpi/generated from kube-openmpi-hostfile-dir (rw)
  /kube-openmpi/guillotine from kube-openmpi-guillotine (rw)
  /kube-openmpi/hostfile-updater-params from kube-openmpi-hostfile-updater-params (rw)
  /kube-openmpi/utils from kube-openmpi-utils (rw)
  /var/run/secrets/kubernetes.io/serviceaccount from default-token-dzhst (ro)

Conditions: Type Status Initialized False Ready False PodScheduled True Volumes: kube-openmpi-guillotine: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium:
kube-openmpi-hostfile-dir: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium:
kube-openmpi-hostfile-updater-params: Type: ConfigMap (a volume populated by a ConfigMap) Name: coco-assets Optional: false kube-openmpi-utils: Type: ConfigMap (a volume populated by a ConfigMap) Name: coco-assets Optional: false kube-openmpi-ssh-key: Type: Secret (a volume populated by a Secret) SecretName: coco-ssh-key Optional: false default-token-dzhst: Type: Secret (a volume populated by a Secret) SecretName: default-token-dzhst Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message


Normal Scheduled 32m default-scheduler Successfully assigned coco-master to minikube Normal SuccessfulMountVolume 32m kubelet, minikube MountVolume.SetUp succeeded for volume "kube-openmpi-guillotine" Normal SuccessfulMountVolume 32m kubelet, minikube MountVolume.SetUp succeeded for volume "kube-openmpi-hostfile-dir" Normal SuccessfulMountVolume 32m kubelet, minikube MountVolume.SetUp succeeded for volume "kube-openmpi-utils" Normal SuccessfulMountVolume 32m kubelet, minikube MountVolume.SetUp succeeded for volume "kube-openmpi-ssh-key" Normal SuccessfulMountVolume 32m kubelet, minikube MountVolume.SetUp succeeded for volume "kube-openmpi-hostfile-updater-params" Normal SuccessfulMountVolume 32m kubelet, minikube MountVolume.SetUp succeeded for volume "default-token-dzhst" Normal Pulled 32m (x4 over 32m) kubelet, minikube Container image "everpeace/kubectl:1.9.2" already present on machine Normal Created 32m (x4 over 32m) kubelet, minikube Created container Normal Started 32m (x4 over 32m) kubelet, minikube Started container Warning BackOff 2m (x138 over 32m) kubelet, minikube Back-off restarting failed container

kubectl logs coco-master -c hostfile-initializer

target=$1

trap "rm -f ${target}_new" EXIT TERM INT KILL

cluster_size=$(kubectl -n default get statefulsets coco-worker -o jsonpath='{.status.replicas}')

did i miss something? thank you in advance

zakiournani commented 6 years ago

kubectl exec -it coco-master -- mpiexec --allow-run-as-root --hostfile /kube-openmpi/generated/hostfile --display-map -n 4 -npernode 1 sh -c 'echo $(hostname):hello' Defaulting container name to mpi-master.

Use 'kubectl describe pod/coco-master -n default' to see all of the containers in this pod. error: unable to upgrade connection: container not found ("mpi-master")

glmdev commented 5 years ago

I am also having this issue. Anyone?

zakiournani commented 5 years ago

Hi, I do not remember the details of the bug, but i figured another way to run MPI on kubernetes that you can check on https://github.com/zakiournani/Simple-MPI-cluster-on-Kubernetes Cheers.