Closed Eta0 closed 1 year ago
I just published a new CUDA 12 image in NCCL tests, please use it as base. Thank you!
Updated!
Heads up, there is a small typo over in the nccl-tests README:
--- a/README.md
+++ b/README.md
@@ -21,7 +21,7 @@ CoreWeave also [publishes images](https://hub.docker.com/r/coreweave/nccl-tests/
| **Image Tag** | **CUDA** | **NCCL** | **HPC-X** |
|---------------|----------|----------|-----------|
-| ghcr.io/coreweave/nccl-tests:12.0.1-devel-ubuntu20.04-nccl2.16.5-1-8fc1a44 | 12.0.1 | 2.17.1 | 2.14.0 |
+| ghcr.io/coreweave/nccl-tests:12.0.1-devel-ubuntu20.04-nccl2.17.1-1-8fc1a44 | 12.0.1 | 2.17.1 | 2.14.0 |
| ghcr.io/coreweave/nccl-tests:11.8.0-devel-ubuntu20.04-nccl2.16.2-1-4a46534 | 11.8.0 | 2.16.2 | 2.14.0 |
| ghcr.io/coreweave/nccl-tests:11.7.1-devel-ubuntu20.04-nccl2.14.3-1-4a46534 | 11.7.1 | 2.14.3 | 2.14.0 |
| coreweave/nccl-tests:2022-09-28_16-34-19.392_EDT | 11.6.2 | 2.12.0 | 2.12 |
Wanna PR that there?On Mar 2, 2023, at 06:23, Eta @.***> wrote: Updated! Heads up, there is a small typo over in the nccl-tests README: --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ CoreWeave also [publishes images](https://hub.docker.com/r/coreweave/nccl-tests/
Image Tag | CUDA | NCCL | HPC-X | |
---|---|---|---|---|
- | ghcr.io/coreweave/nccl-tests:12.0.1-devel-ubuntu20.04-nccl2.16.5-1-8fc1a44 | 12.0.1 | 2.17.1 | 2.14.0 |
+ | ghcr.io/coreweave/nccl-tests:12.0.1-devel-ubuntu20.04-nccl2.17.1-1-8fc1a44 | 12.0.1 | 2.17.1 | 2.14.0 |
ghcr.io/coreweave/nccl-tests:11.8.0-devel-ubuntu20.04-nccl2.16.2-1-4a46534 | 11.8.0 | 2.16.2 | 2.14.0 | |
ghcr.io/coreweave/nccl-tests:11.7.1-devel-ubuntu20.04-nccl2.14.3-1-4a46534 | 11.7.1 | 2.14.3 | 2.14.0 | |
coreweave/nccl-tests:2022-09-28_16-34-19.392_EDT | 11.6.2 | 2.12.0 | 2.12 |
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>
@wbrown the ml containers support multiple build artifacts from the same Dockerfile, so you can have one based on the NCCL test image for distributed training and one on the base nccl from nvidia! It's super neat! The SLURM images do that.
@salanki oh, awesome!
Wanna PR that there?
Could we do a github workflow for non-NCCL builds?
Will do! I was going to make a separate PR later with a torch-base
build folder, but since they're so similar it would just be nicer to parameterize them to cover all the differences between CUDA 11.8 / CUDA 12.0 / base / nccl all in one Dockerfile. I'll get that working.
Open to descriptive name recommendations for the build, too, to replace torch-nccl
. I think either "torch
" or "torch-base
" would work.
If you are looking at parameterizing one Dockerfile and doing that with the CI, you may want to consider a build matrix for the CI.
@Eta0 Let's merge this, and do the build matrix, etc, as a separate issue/thing.
This adds an image with PyTorch v2.0.0-rc1 + torchvision 0.15.0-rc1 libraries built from source with, and extending, the latest CUDA 12.0.1 nccl-tests image.
Notably, the PyTorch build includes the CMake build flags
USE_SYSTEM_NCCL=1
andUSE_SYSTEM_UCC=1
to properly make use of the packages provided by nccl-tests.