aws-samples / aws-efa-eks

Deploying EFA in EKS utilizing GPUDirectRDMA where supported
MIT No Attribution
35 stars 19 forks source link

Public ECR repo not found for NCCL tests #25

Open ytsssun opened 2 months ago

ytsssun commented 2 months ago

Issue

I am trying to run the EFA EKS NCCL tests, but the public ECR image (public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma) referenced here no longer exists. What is the alternative for that?

Pod log:

Events:
  Type     Reason   Age                  From     Message
  ----     ------   ----                 ----     -------
  Normal   Pulling  59m (x4 over 61m)    kubelet  Pulling image "public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:22.03-pt-py3"
  Warning  Failed   59m (x4 over 61m)    kubelet  Failed to pull image "public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:22.03-pt-py3": rpc error: code = NotFound desc = failed to pull and unpack image "public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:22.03-pt-py3": failed to resolve reference "public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:22.03-pt-py3": public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:22.03-pt-py3: not found
  Warning  Failed   59m (x4 over 61m)    kubelet  Error: ErrImagePull
  Warning  Failed   59m (x6 over 61m)    kubelet  Error: ImagePullBackOff
  Normal   BackOff  76s (x260 over 61m)  kubelet  Back-off pulling image "public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:22.03-pt-py3"

What I expect

I would expect to be able to run the EKA EKS NCCL tests with no problem.

uruddarraju commented 1 month ago

+1, running into the same issue

bryantbiggs commented 1 month ago

there is a new image available at public.ecr.aws/hpc-cloud/nccl-tests:latest which its Dockerfile can be found here https://github.com/aws-samples/awsome-distributed-training/blob/main/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile