Eta0 commented 1 year ago

Multiple Torch Builds

This PR adds support for two distinct types of PyTorch builds:

torch:base
- New home for gooseai/torch-base-type containers
- Builds with nvidia/cuda:[...]-devel-[...]
- Extends nvidia/cuda:[...]-base-[...]
- Low image size, fewer libraries at runtime
torch:nccl
- Builds with and extends ghcr.io/coreweave/nccl-tests
- Inherits special versions of nccl and HPC-X
- Larger image size, extensive libraries at runtime
- Supersedes old torch-nccl containers from this repository

These are both grouped under the "torch" container name, with differing tags.

Tag Suffixes

The existing tag format for containers in this repository is :<short commit hash> for commits to the main branch, and :<branch name>-<short commit hash> for commits to other branches. Since these torch builds are highly parameterized (potential variables: edition [torch:base/torch:nccl], CUDA version, NCCL version, torch version, torchvision version), it is beneficial to allow for tag customization to denote different build versions—which may also benefit from the then-shared build cache.

To achieve this, this PR adds an optional tag-suffix parameter to the .github/workflows/build.yml template. This allows for tags of the form [<branch name>-]<short commit hash>[-<custom suffix>].

Examples

torch:base builds are suffixed with a version string to result in the following format:

 ghcr.io/coreweave/ml-containers/torch:<short commit hash>-base-cuda{}-torch{}-vision{}

Where {} is a placeholder for the respective library's version number.

torch:nccl builds have a similar format:

 ghcr.io/coreweave/ml-containers/torch:<short commit hash>-nccl-cuda{}-nccl{}-torch{}-vision{}

Fix for UCC integration in `torch:nccl` builds

torch:nccl builds now enable UCC integration, linking against HPC-X.

This was intended to be enabled when the torch-nccl builds were added, but the right combination of flags had not been set (USE_SYSTEM_UCC does not imply USE_UCC, for some reason), and furthermore a bug would have made it impossible to link against the UCX library in nccl-tests's HPC-X distribution anyway.

The combined torch Dockerfile now contains a workaround to fix that bug when nccl-tests is used as a base, and properly links against UCX.

Build process optimizations

The torch builds have been optimized for size, speed, and cache-friendliness by making use of more build stages, as well as cross-stage filesystem mounts.

Highlights:

Torch repository versions are now downloaded as their own build stage.
- This means they can now be cached separately and shared across runs with different CUDA base images.
- This also means they can download in parallel with the builder stage.
No torch source files are stored in the builder stage's filesystem; they are only mounted temporarily from the downloader stage.
Torch build artifacts are all cleared except for the wheels to be distributed, to avoid bloating the intermediate image in the cache.
The final image no longer makes a copy of the torch wheels before installing them.
- They are instead temporarily linked directly from the builder stage.

arsenetar commented 1 year ago

This looks awesome, like the nice PR doc.

Those image tags are getting long, splitting does have some impacts on the cache although since cache is synced at the end of each of the CI jobs anything running in parallel in CI will miss cache created by other parallel runs (which is unfortunate since these often get run together). I would almost prefer keeping torch-base and torch-nccl as two separate images to maybe split that difference, but not a very strong opinion.

arsenetar commented 1 year ago

Also noticed that the workflows here only work out of the main repo since we only run them on push, to make PRs from forks able to run we need to add pull_request to the triggers for the top level workflows with the same path filters. Mainly helpful to know that PRs result in working builds.

Eta0 commented 1 year ago

@arsenetar

This looks awesome, like the nice PR doc.

Those image tags are getting long, splitting does have some impacts on the cache although since cache is synced at the end of each of the CI jobs anything running in parallel in CI will miss cache created by other parallel runs (which is unfortunate since these often get run together). I would almost prefer keeping torch-base and torch-nccl as two separate images to maybe split that difference, but not a very strong opinion.

Image tags

As far as legal limits, the maximum length for a docker image tag (excluding the name) is 128 characters. A fully formatted torch:nccl tag with a 7-character hash and no branch prefix looks like abc1234-nccl-cuda11.8.0-nccl2.16.2-1-torch2.0.0-rc1-vision0.15.0-rc1, with 68 characters (though these are a bit longer than usual because of -rc1 version suffixes). This would leave about half of the character quota for an optional branch name, which should be plenty of room.

As far as aesthetics, having an extremely descriptive tag is pretty nice for visually scanning through versions quickly in the container repository, and being able to pluck out the right container without having to cross-reference docs/code/commits or manually downloading and probing them. And it ought to help people who use them in a Dockerfile keep track of any major changes if they want to update the version they are using later on.

So I think it'd be beneficial enough from a usability perspective to outweigh its visual clunkiness.

Caching and concurrency

On the first run they will miss out on cache-sharing, though on most subsequent updates the cache should help them out. For instance, if the base images for torch:nccl change, both parallel runs should be able to reuse the downloaded PyTorch repositories. And, since the builds running in parallel all use different base images, they wouldn't be able to share their cache for the builder steps anyway, so there isn't too much missed opportunity in that direction.

If the downloaded repositories change, both builds will re-download them in parallel, which will waste some bandwidth. I'm not sure if it would waste any storage: since the intermediate downloader stages will end up with identical contents, Docker might deduplicate the later one based on its hash to save storage.

The builds can have their concurrency limited to an arbitrary extent with job concurrency groups to get maximal use out of of the downloader cache, but since each of these builds can take several hours, serializing them in any way may be a bit painful.

Also noticed that the workflows here only work out of the main repo since we only run them on push, to make PRs from forks able to run we need to add pull_request to the triggers for the top level workflows with the same path filters. Mainly helpful to know that PRs result in working builds.

Pull Request Builds

This is a good suggestion; none of the containers in this repository are currently built on the pull_request trigger.

Building a PR from an arbitrary fork results in arbitrary code execution on the runner, so workflows triggered from pull_request events have limited access to secrets. This would mean that they can't log in or push to the ml-containers ghcr.io repository, and since those steps are baked into the build.yml script, the runs would fail automatically.

The build and publish steps would benefit from being disentangled from one another, with the publish step being optional. This could also allow builds in branches to be tested without necessarily publishing them to the container image repository forever. Since this is a somewhat involved, repo-wide change with its own design considerations, it would be good to make it a separate issue, implemented apart from this one.

In the case of this PR, I test-ran the workflows by modifying build.yml's runs-on: [self-hosted, Linux] to runs-on: ubuntu-20.04 in a separate branch in my forked repo, which lets most of the steps run, though inefficiently. So these should be valid.

arsenetar commented 1 year ago

With regards to image tags, I was not worried about the max limit, and it seems like it should be large enough. Splitting back to two different images for torch-nccl and torch-base was more about a top level clarity on there being two different "main" images more than anything (and slightly due to the one image already existing). It also makes it slightly clearer that torch-nccl image is from the torch-nccl.yaml and such. Not going to push too strongly for this, so if @wbrown is ok with the change to just a torch image with two distinct builds represented through the tags then I'll accept as well.

I would not want to limit concurrency of these builds any, the cache misses are not the worst here, and even so the matrix already creates a decent bit of invalidation between the different builds even though now there are more sections that can be shared with this new Dockerfile.

PR Builds from forks require approval to run which does help mitigate some concerns, however fork workflows do still have some differences as you stated. Disabling push makes sense for forks as the intent is to verify the build succeeds more than anything else. The impacted parts can be disabled fairly straightforward within the "build and push" step. I do agree we can address that separately, was mainly making note for us.

wbrown commented 1 year ago

@arsenetar @Eta0 Are we good to merge this?

arsenetar commented 1 year ago

@wbrown It was up to you to decide direction on the image change torch-nccl and torch-base vs torch (with nccl and base tags).

Eta0 commented 1 year ago

Confirmed with @wbrown; he said that torch with tag distinctions should be fine, so I'll merge it like this.

coreweave / ml-containers

Multiple types of Torch builds #16

Multiple Torch Builds

Tag Suffixes

Examples

Fix for UCC integration in `torch:nccl` builds

Build process optimizations

Image tags

Caching and concurrency

Pull Request Builds

coreweave / ml-containers

Multiple types of Torch builds #16

Multiple Torch Builds

Tag Suffixes

Examples

Fix for UCC integration in torch:nccl builds

Build process optimizations

Image tags

Caching and concurrency

Pull Request Builds

Fix for UCC integration in `torch:nccl` builds