coreweave / ml-containers

MIT License
21 stars 3 forks source link

update(gpt-neox-mpi): Update cuda and apex versions #10

Closed NavarrePratt closed 1 year ago

NavarrePratt commented 1 year ago

The original ElutherAI gpt-neox images used an older CUDA base image along with older versions of torch and apex. This wasn't leading to any problems at first, but I noticed that there would be a NCCL error thrown whenever a worker would run on a g99* A100 NVLINK 80gb node. Once I upgraded the versions I would no longer see the NCCL error on those nodes.

NCCL Error:

NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8

Two specific nodes that I know threw these errors:

salanki commented 1 year ago

We might want to build this on our NCCL images, so there is hpc-x etc for SHARP and nccl tests for debugging: https://github.com/coreweave/nccl-tests. Another option is to use the NVIDIA PyTorch image from NGC. It has APEX etc so you don't have to compile it from scratch, but it is bloated.

github-actions[bot] commented 1 year ago

@NavarrePratt Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/4058743895 Image: ghcr.io/coreweave/ml-containers/gpt-neox-mpi:npratt-update-gpt-neox-mpi-299e026

github-actions[bot] commented 1 year ago

@NavarrePratt Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/4075620331 Image: ghcr.io/coreweave/ml-containers/gpt-neox-mpi:npratt-update-gpt-neox-mpi-39cb8e4