Closed NavarrePratt closed 1 year ago
We might want to build this on our NCCL images, so there is hpc-x etc for SHARP and nccl tests for debugging: https://github.com/coreweave/nccl-tests. Another option is to use the NVIDIA PyTorch image from NGC. It has APEX etc so you don't have to compile it from scratch, but it is bloated.
@NavarrePratt Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/4058743895
Image: ghcr.io/coreweave/ml-containers/gpt-neox-mpi:npratt-update-gpt-neox-mpi-299e026
@NavarrePratt Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/4075620331
Image: ghcr.io/coreweave/ml-containers/gpt-neox-mpi:npratt-update-gpt-neox-mpi-39cb8e4
The original ElutherAI gpt-neox images used an older CUDA base image along with older versions of torch and apex. This wasn't leading to any problems at first, but I noticed that there would be a NCCL error thrown whenever a worker would run on a
g99*
A100 NVLINK 80gb node. Once I upgraded the versions I would no longer see the NCCL error on those nodes.NCCL Error:
Two specific nodes that I know threw these errors: