aws-samples / aws-do-eks

MIT No Attribution
42 stars 27 forks source link

MPIJob EFA example doesn't apply #19

Open kwohlfahrt opened 10 months ago

kwohlfahrt commented 10 months ago

The MPIJob EFA example here, doesn't apply cleanly, it shows the following error:

Error from server (BadRequest): error when creating "mpijob.yaml": MPIJob in version "v2beta1" cannot be handled as a MPIJob: strict decoding error: unknown field "spec.mpiReplicaSpecs.launcher.template.spec.imagePullPolicy", unknown field "spec.mpiReplicaSpecs.worker.template.spec.imagePullPolicy"

The issue is that the imagePullPolicy must be specified on the container, not the spec. Changing it so the scheduler reads like this (and the same for the worker) allows it to apply:

  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
         spec:
          restartPolicy: OnFailure
          containers:
          #- image: <account>.dkr.ecr.us-west-2.amazonaws.com/cuda-efa-nccl-tests:ubuntu18.04
          - image: public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:base-cudnn8-cuda11-ubuntu18.04
            imagePullPolicy: IfNotPresent

Edit: actually, even with this fix, I'm unable to get it running. The connection from the launcher is refused by the worker: Connection reset by 172.17.5.245 port 22.