Closed iankouls-aws closed 11 months ago
Went through the build on a c5.4xlarge, Amazon Linux 2 and I didn't have any issues. Can you share environment confiruation: instance type, OS, kernel version of the host?
Step 20/23 : RUN git clone https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests && cd /opt/nccl-tests && git checkout ${NCCL_TESTS_VERSION} && make -j $(nproc) MPI=1 MPI_HOME=/opt/amazon/openmpi/ CUDA_HOME=/usr/local/cuda NCCL_HOME=/opt/nccl/build NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_90,code=sm_90"
---> Running in c4e951b5b44b
Cloning into '/opt/nccl-tests'...
Already on 'master'
Your branch is up to date with 'origin/master'.
make -C src build BUILDDIR=/opt/nccl-tests/build
make[1]: Entering directory '/opt/nccl-tests/src'
Compiling timer.cc > /opt/nccl-tests/build/timer.o
Compiling /opt/nccl-tests/build/verifiable/verifiable.o
Compiling all_reduce.cu > /opt/nccl-tests/build/all_reduce.o
Compiling common.cu > /opt/nccl-tests/build/common.o
Compiling all_gather.cu > /opt/nccl-tests/build/all_gather.o
Compiling broadcast.cu > /opt/nccl-tests/build/broadcast.o
Compiling reduce_scatter.cu > /opt/nccl-tests/build/reduce_scatter.o
Compiling reduce.cu > /opt/nccl-tests/build/reduce.o
Compiling alltoall.cu > /opt/nccl-tests/build/alltoall.o
Compiling scatter.cu > /opt/nccl-tests/build/scatter.o
Compiling gather.cu > /opt/nccl-tests/build/gather.o
Compiling sendrecv.cu > /opt/nccl-tests/build/sendrecv.o
Compiling hypercube.cu > /opt/nccl-tests/build/hypercube.o
Linking /opt/nccl-tests/build/all_reduce.o > /opt/nccl-tests/build/all_reduce_perf
Linking /opt/nccl-tests/build/all_gather.o > /opt/nccl-tests/build/all_gather_perf
Linking /opt/nccl-tests/build/broadcast.o > /opt/nccl-tests/build/broadcast_perf
Linking /opt/nccl-tests/build/reduce_scatter.o > /opt/nccl-tests/build/reduce_scatter_perf
Linking /opt/nccl-tests/build/reduce.o > /opt/nccl-tests/build/reduce_perf
Linking /opt/nccl-tests/build/alltoall.o > /opt/nccl-tests/build/alltoall_perf
Linking /opt/nccl-tests/build/scatter.o > /opt/nccl-tests/build/scatter_perf
Linking /opt/nccl-tests/build/gather.o > /opt/nccl-tests/build/gather_perf
Linking /opt/nccl-tests/build/sendrecv.o > /opt/nccl-tests/build/sendrecv_perf
Linking /opt/nccl-tests/build/hypercube.o > /opt/nccl-tests/build/hypercube_perf
make[1]: Leaving directory '/opt/nccl-tests/src'
Removing intermediate container c4e951b5b44b
---> 07e0bb88e5e5
Step 21/23 : ENV NCCL_PROTO simple
---> Running in b83bb2f2155b
Removing intermediate container b83bb2f2155b
---> c78062b3f357
Step 22/23 : RUN rm -rf /var/lib/apt/lists/*
---> Running in 40a1e0301b6c
Removing intermediate container 40a1e0301b6c
---> 8a66514bbbbd
Step 23/23 : ENV LD_PRELOAD /opt/nccl/build/lib/libnccl.so
---> Running in 031a5c6c8c64
Removing intermediate container 031a5c6c8c64
---> 45e48200ed4e
Successfully built 45e48200ed4e
The root cause of this issue was hidden in the build prior to the nccl-tests step:
./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify
= Starting Amazon Elastic Fabric Adapter Installation Script =
= EFA Installer Version: 1.28.0 =
Unsupported operating system.
Refer EFA documentation (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-amis) for more details on supported OSes.
The reason for this error was using a ubuntu:18.04 base image instead of 22.04. EFA installer v1.28.0 supports ubuntu:20.04 and ubuntu:22.04
After the latest update of the Dockerfile (commit 8a9e75dd956377ff75126ffe4de8a1144c2db02e): https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_scripts/0.nccl-tests/0.nccl-tests.Dockerfile it fails to build with the following error: