aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
201 stars 84 forks source link

NCCL tests Docker image fails to build due to missing mpi.h #74

Closed iankouls-aws closed 11 months ago

iankouls-aws commented 11 months ago

After the latest update of the Dockerfile (commit 8a9e75dd956377ff75126ffe4de8a1144c2db02e): https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_scripts/0.nccl-tests/0.nccl-tests.Dockerfile it fails to build with the following error:

#7 731.9 Compiling  all_reduce.cu                       > /opt/nccl-tests/build/all_reduce.o
#7 732.0 In file included from all_reduce.cu:8:0:
#7 732.0 common.h:14:10: fatal error: mpi.h: No such file or directory
#7 732.0  #include "mpi.h"
#7 732.0           ^~~~~~~
#7 732.0 compilation terminated.
#7 732.0 Makefile:92: recipe for target '/opt/nccl-tests/build/all_reduce.o' failed
#7 732.0 make[1]: Leaving directory '/opt/nccl-tests/src'
#7 732.0 make[1]: *** [/opt/nccl-tests/build/all_reduce.o] Error 1
#7 732.0 make: *** [src.build] Error 2
#7 732.0 Makefile:20: recipe for target 'src.build' failed
#7 DONE 732.2s
mhuguesaws commented 11 months ago

Went through the build on a c5.4xlarge, Amazon Linux 2 and I didn't have any issues. Can you share environment confiruation: instance type, OS, kernel version of the host?

Step 20/23 : RUN git clone https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests     && cd /opt/nccl-tests     && git checkout ${NCCL_TESTS_VERSION}     && make -j $(nproc)     MPI=1     MPI_HOME=/opt/amazon/openmpi/     CUDA_HOME=/usr/local/cuda     NCCL_HOME=/opt/nccl/build  NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_90,code=sm_90"
 ---> Running in c4e951b5b44b
Cloning into '/opt/nccl-tests'...
Already on 'master'
Your branch is up to date with 'origin/master'.
make -C src build BUILDDIR=/opt/nccl-tests/build
make[1]: Entering directory '/opt/nccl-tests/src'
Compiling  timer.cc                            > /opt/nccl-tests/build/timer.o
Compiling /opt/nccl-tests/build/verifiable/verifiable.o
Compiling  all_reduce.cu                       > /opt/nccl-tests/build/all_reduce.o
Compiling  common.cu                           > /opt/nccl-tests/build/common.o
Compiling  all_gather.cu                       > /opt/nccl-tests/build/all_gather.o
Compiling  broadcast.cu                        > /opt/nccl-tests/build/broadcast.o
Compiling  reduce_scatter.cu                   > /opt/nccl-tests/build/reduce_scatter.o
Compiling  reduce.cu                           > /opt/nccl-tests/build/reduce.o
Compiling  alltoall.cu                         > /opt/nccl-tests/build/alltoall.o
Compiling  scatter.cu                          > /opt/nccl-tests/build/scatter.o
Compiling  gather.cu                           > /opt/nccl-tests/build/gather.o
Compiling  sendrecv.cu                         > /opt/nccl-tests/build/sendrecv.o
Compiling  hypercube.cu                        > /opt/nccl-tests/build/hypercube.o
Linking  /opt/nccl-tests/build/all_reduce.o  > /opt/nccl-tests/build/all_reduce_perf
Linking  /opt/nccl-tests/build/all_gather.o  > /opt/nccl-tests/build/all_gather_perf
Linking  /opt/nccl-tests/build/broadcast.o   > /opt/nccl-tests/build/broadcast_perf
Linking  /opt/nccl-tests/build/reduce_scatter.o > /opt/nccl-tests/build/reduce_scatter_perf
Linking  /opt/nccl-tests/build/reduce.o      > /opt/nccl-tests/build/reduce_perf
Linking  /opt/nccl-tests/build/alltoall.o    > /opt/nccl-tests/build/alltoall_perf
Linking  /opt/nccl-tests/build/scatter.o     > /opt/nccl-tests/build/scatter_perf
Linking  /opt/nccl-tests/build/gather.o      > /opt/nccl-tests/build/gather_perf
Linking  /opt/nccl-tests/build/sendrecv.o    > /opt/nccl-tests/build/sendrecv_perf
Linking  /opt/nccl-tests/build/hypercube.o   > /opt/nccl-tests/build/hypercube_perf
make[1]: Leaving directory '/opt/nccl-tests/src'
Removing intermediate container c4e951b5b44b
 ---> 07e0bb88e5e5
Step 21/23 : ENV NCCL_PROTO simple
 ---> Running in b83bb2f2155b
Removing intermediate container b83bb2f2155b
 ---> c78062b3f357
Step 22/23 : RUN rm -rf /var/lib/apt/lists/*
 ---> Running in 40a1e0301b6c
Removing intermediate container 40a1e0301b6c
 ---> 8a66514bbbbd
Step 23/23 : ENV LD_PRELOAD /opt/nccl/build/lib/libnccl.so
 ---> Running in 031a5c6c8c64
Removing intermediate container 031a5c6c8c64
 ---> 45e48200ed4e
Successfully built 45e48200ed4e
iankouls-aws commented 11 months ago

The root cause of this issue was hidden in the build prior to the nccl-tests step:

./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify
= Starting Amazon Elastic Fabric Adapter Installation Script =
= EFA Installer Version: 1.28.0 =

Unsupported operating system.
Refer EFA documentation (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-amis) for more details on supported OSes.

The reason for this error was using a ubuntu:18.04 base image instead of 22.04. EFA installer v1.28.0 supports ubuntu:20.04 and ubuntu:22.04