coreweave / ml-containers

MIT License
21 stars 3 forks source link

feat(torch-nccl): Add torch-nccl image #15

Closed Eta0 closed 1 year ago

Eta0 commented 1 year ago

This adds an image with PyTorch v2.0.0-rc1 + torchvision 0.15.0-rc1 libraries built from source with, and extending, the latest CUDA 12.0.1 nccl-tests image.

Notably, the PyTorch build includes the CMake build flags USE_SYSTEM_NCCL=1 and USE_SYSTEM_UCC=1 to properly make use of the packages provided by nccl-tests.

salanki commented 1 year ago

I just published a new CUDA 12 image in NCCL tests, please use it as base. Thank you!

Eta0 commented 1 year ago

Updated!

Heads up, there is a small typo over in the nccl-tests README:

--- a/README.md
+++ b/README.md
@@ -21,7 +21,7 @@ CoreWeave also [publishes images](https://hub.docker.com/r/coreweave/nccl-tests/

 | **Image Tag** | **CUDA** | **NCCL** | **HPC-X** |
 |---------------|----------|----------|-----------|
-| ghcr.io/coreweave/nccl-tests:12.0.1-devel-ubuntu20.04-nccl2.16.5-1-8fc1a44 | 12.0.1   | 2.17.1   | 2.14.0    |
+| ghcr.io/coreweave/nccl-tests:12.0.1-devel-ubuntu20.04-nccl2.17.1-1-8fc1a44 | 12.0.1   | 2.17.1   | 2.14.0    |
 | ghcr.io/coreweave/nccl-tests:11.8.0-devel-ubuntu20.04-nccl2.16.2-1-4a46534 | 11.8.0   | 2.16.2   | 2.14.0    |
 | ghcr.io/coreweave/nccl-tests:11.7.1-devel-ubuntu20.04-nccl2.14.3-1-4a46534 | 11.7.1   | 2.14.3   | 2.14.0    |
 | coreweave/nccl-tests:2022-09-28_16-34-19.392_EDT            | 11.6.2   | 2.12.0   | 2.12      |
salanki commented 1 year ago

Wanna PR that there?On Mar 2, 2023, at 06:23, Eta @.***> wrote: Updated! Heads up, there is a small typo over in the nccl-tests README: --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ CoreWeave also [publishes images](https://hub.docker.com/r/coreweave/nccl-tests/

Image Tag CUDA NCCL HPC-X
- ghcr.io/coreweave/nccl-tests:12.0.1-devel-ubuntu20.04-nccl2.16.5-1-8fc1a44 12.0.1 2.17.1 2.14.0
+ ghcr.io/coreweave/nccl-tests:12.0.1-devel-ubuntu20.04-nccl2.17.1-1-8fc1a44 12.0.1 2.17.1 2.14.0
ghcr.io/coreweave/nccl-tests:11.8.0-devel-ubuntu20.04-nccl2.16.2-1-4a46534 11.8.0 2.16.2 2.14.0
ghcr.io/coreweave/nccl-tests:11.7.1-devel-ubuntu20.04-nccl2.14.3-1-4a46534 11.7.1 2.14.3 2.14.0
coreweave/nccl-tests:2022-09-28_16-34-19.392_EDT 11.6.2 2.12.0 2.12

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

salanki commented 1 year ago

@wbrown the ml containers support multiple build artifacts from the same Dockerfile, so you can have one based on the NCCL test image for distributed training and one on the base nccl from nvidia! It's super neat! The SLURM images do that.

wbrown commented 1 year ago

@salanki oh, awesome!

Eta0 commented 1 year ago

Wanna PR that there?

Pull request...ed! ✅

Could we do a github workflow for non-NCCL builds?

Will do! I was going to make a separate PR later with a torch-base build folder, but since they're so similar it would just be nicer to parameterize them to cover all the differences between CUDA 11.8 / CUDA 12.0 / base / nccl all in one Dockerfile. I'll get that working.

Open to descriptive name recommendations for the build, too, to replace torch-nccl. I think either "torch" or "torch-base" would work.

arsenetar commented 1 year ago

If you are looking at parameterizing one Dockerfile and doing that with the CI, you may want to consider a build matrix for the CI.

wbrown commented 1 year ago

@Eta0 Let's merge this, and do the build matrix, etc, as a separate issue/thing.