coreweave / ml-containers

MIT License
19 stars 3 forks source link

feat(torch): PyTorch 2.0.1, torchvision 0.15.2, torchaudio 2.0.2, and NCCL 2.18 for CUDA 12.0 #23

Closed Eta0 closed 1 year ago

Eta0 commented 1 year ago

Version Updates

This PR updates torch to version 2.0.1, torchvision to 0.15.2, torchaudio to 2.0.2, and updates torch:nccl's base images, bumping torch:nccl's NCCL version to 2.18.1-1 (on CUDA 12.0).

This also fixes a bug in installing libnccl2 in torch:base. More on that below.

libnccl2 Fix and Thoughts

libnccl2's installable package is versioned in the format libnccl2=2.A.B-C+cudaX.Y, unlike other CUDA packages that are versioned in their name, like libcublas-X-Y=.... The install step for libnccl2 then wasn't being narrowed down to the correct CUDA version, so it would just install the version for whatever the latest available CUDA was, rather than the version actually present. This PR includes a fix to filter down the list of libnccl2 version candidates by the glob pattern 2.*+cudaX.Y, using the appropriate major and minor version for the ongoing build in order to install an appropriate version of libnccl2.

Thoughts

This technically bumps down libnccl2 in torch:base to version 2.16.5 (on CUDA 11.8) from higher (though, presumably incompatible) versions it was installing previously. However, it's unclear how much this matters, as installing libnccl2 in torch:base is (mostly) superfluous to begin with, as that build is not configured to use the system-wide nccl install and PyTorch builds its own, instead.

The system libnccl2 install could likely be removed entirely. Since the version that had been used up until now was (presumably) broken anyway, there probably isn't much using it. Alternatively, it could be kept and added to the build stage, and PyTorch could be configured to use the system-wide nccl install for torch:base as well. As-is, it is a bit redundant, and potentially a waste of a couple hundred megabytes in the image.

Removing the HPC-X Build Workaround

This PR also removes the torch:nccl-specific workaround for linking against HPC-X that was originally added in #16, as the underlying problem has been fixed upstream with the merging of nccl-tests PR #12, and torch:nccl's base image has been updated to take advantage of that.

github-actions[bot] commented 1 year ago

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/5018629617 Image: ghcr.io/coreweave/ml-containers/torch:es-torch-2.0.1-3a325d0-base-cuda11.8.0-torch2.0.1-vision0.15.2-audio2.0.2

github-actions[bot] commented 1 year ago

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/5018629617 Image: ghcr.io/coreweave/ml-containers/torch:es-torch-2.0.1-3a325d0-base-cuda12.0.1-torch2.0.1-vision0.15.2-audio2.0.2

github-actions[bot] commented 1 year ago

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/5018629623 Image: ghcr.io/coreweave/ml-containers/torch:es-torch-2.0.1-3a325d0-nccl-cuda12.0.1-nccl2.18.1-1-torch2.0.1-vision0.15.2-audio2.0.2

github-actions[bot] commented 1 year ago

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/5018629623 Image: ghcr.io/coreweave/ml-containers/torch:es-torch-2.0.1-3a325d0-nccl-cuda11.8.0-nccl2.16.2-1-torch2.0.1-vision0.15.2-audio2.0.2

github-actions[bot] commented 1 year ago

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/5018913069 Image: ghcr.io/coreweave/ml-containers/torch:es-torch-2.0.1-4db0c9d-base-cuda12.0.1-torch2.0.1-vision0.15.2-audio2.0.2

github-actions[bot] commented 1 year ago

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/5018913069 Image: ghcr.io/coreweave/ml-containers/torch:es-torch-2.0.1-4db0c9d-base-cuda11.8.0-torch2.0.1-vision0.15.2-audio2.0.2

github-actions[bot] commented 1 year ago

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/5018913066 Image: ghcr.io/coreweave/ml-containers/torch:es-torch-2.0.1-4db0c9d-nccl-cuda11.8.0-nccl2.16.2-1-torch2.0.1-vision0.15.2-audio2.0.2

github-actions[bot] commented 1 year ago

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/5018913066 Image: ghcr.io/coreweave/ml-containers/torch:es-torch-2.0.1-4db0c9d-nccl-cuda12.0.1-nccl2.18.1-1-torch2.0.1-vision0.15.2-audio2.0.2