Closed Eta0 closed 1 year ago
This looks awesome, like the nice PR doc.
Those image tags are getting long, splitting does have some impacts on the cache although since cache is synced at the end of each of the CI jobs anything running in parallel in CI will miss cache created by other parallel runs (which is unfortunate since these often get run together). I would almost prefer keeping torch-base
and torch-nccl
as two separate images to maybe split that difference, but not a very strong opinion.
Also noticed that the workflows here only work out of the main repo since we only run them on push, to make PRs from forks able to run we need to add pull_request
to the triggers for the top level workflows with the same path filters. Mainly helpful to know that PRs result in working builds.
@arsenetar
This looks awesome, like the nice PR doc.
Those image tags are getting long, splitting does have some impacts on the cache although since cache is synced at the end of each of the CI jobs anything running in parallel in CI will miss cache created by other parallel runs (which is unfortunate since these often get run together). I would almost prefer keeping
torch-base
andtorch-nccl
as two separate images to maybe split that difference, but not a very strong opinion.
As far as legal limits, the maximum length for a docker image tag (excluding the name) is 128 characters. A fully formatted torch:nccl
tag with a 7-character hash and no branch prefix looks like abc1234-nccl-cuda11.8.0-nccl2.16.2-1-torch2.0.0-rc1-vision0.15.0-rc1
, with 68 characters (though these are a bit longer than usual because of -rc1
version suffixes). This would leave about half of the character quota for an optional branch name, which should be plenty of room.
As far as aesthetics, having an extremely descriptive tag is pretty nice for visually scanning through versions quickly in the container repository, and being able to pluck out the right container without having to cross-reference docs/code/commits or manually downloading and probing them. And it ought to help people who use them in a Dockerfile keep track of any major changes if they want to update the version they are using later on.
So I think it'd be beneficial enough from a usability perspective to outweigh its visual clunkiness.
On the first run they will miss out on cache-sharing, though on most subsequent updates the cache should help them out. For instance, if the base images for torch:nccl
change, both parallel runs should be able to reuse the downloaded PyTorch repositories. And, since the builds running in parallel all use different base images, they wouldn't be able to share their cache for the builder steps anyway, so there isn't too much missed opportunity in that direction.
If the downloaded repositories change, both builds will re-download them in parallel, which will waste some bandwidth. I'm not sure if it would waste any storage: since the intermediate downloader stages will end up with identical contents, Docker might deduplicate the later one based on its hash to save storage.
The builds can have their concurrency limited to an arbitrary extent with job concurrency groups to get maximal use out of of the downloader cache, but since each of these builds can take several hours, serializing them in any way may be a bit painful.
Also noticed that the workflows here only work out of the main repo since we only run them on push, to make PRs from forks able to run we need to add
pull_request
to the triggers for the top level workflows with the same path filters. Mainly helpful to know that PRs result in working builds.
This is a good suggestion; none of the containers in this repository are currently built on the pull_request
trigger.
Building a PR from an arbitrary fork results in arbitrary code execution on the runner, so workflows triggered from pull_request
events have limited access to secrets. This would mean that they can't log in or push to the ml-containers ghcr.io
repository, and since those steps are baked into the build.yml
script, the runs would fail automatically.
The build and publish steps would benefit from being disentangled from one another, with the publish step being optional. This could also allow builds in branches to be tested without necessarily publishing them to the container image repository forever. Since this is a somewhat involved, repo-wide change with its own design considerations, it would be good to make it a separate issue, implemented apart from this one.
In the case of this PR, I test-ran the workflows by modifying build.yml
's runs-on: [self-hosted, Linux]
to runs-on: ubuntu-20.04
in a separate branch in my forked repo, which lets most of the steps run, though inefficiently. So these should be valid.
With regards to image tags, I was not worried about the max limit, and it seems like it should be large enough. Splitting back to two different images for torch-nccl
and torch-base
was more about a top level clarity on there being two different "main" images more than anything (and slightly due to the one image already existing). It also makes it slightly clearer that torch-nccl
image is from the torch-nccl.yaml
and such. Not going to push too strongly for this, so if @wbrown is ok with the change to just a torch
image with two distinct builds represented through the tags then I'll accept as well.
I would not want to limit concurrency of these builds any, the cache misses are not the worst here, and even so the matrix already creates a decent bit of invalidation between the different builds even though now there are more sections that can be shared with this new Dockerfile.
PR Builds from forks require approval to run which does help mitigate some concerns, however fork workflows do still have some differences as you stated. Disabling push makes sense for forks as the intent is to verify the build succeeds more than anything else. The impacted parts can be disabled fairly straightforward within the "build and push" step. I do agree we can address that separately, was mainly making note for us.
@arsenetar @Eta0 Are we good to merge this?
@wbrown It was up to you to decide direction on the image change torch-nccl
and torch-base
vs torch
(with nccl and base tags).
Confirmed with @wbrown; he said that torch
with tag distinctions should be fine, so I'll merge it like this.
Multiple Torch Builds
This PR adds support for two distinct types of PyTorch builds:
torch:base
gooseai/torch-base
-type containersnvidia/cuda:[...]-devel-[...]
nvidia/cuda:[...]-base-[...]
torch:nccl
ghcr.io/coreweave/nccl-tests
torch-nccl
containers from this repositoryThese are both grouped under the "
torch
" container name, with differing tags.Tag Suffixes
The existing tag format for containers in this repository is
:<short commit hash>
for commits to themain
branch, and:<branch name>-<short commit hash>
for commits to other branches. Since these torch builds are highly parameterized (potential variables: edition [torch:base
/torch:nccl
], CUDA version, NCCL version, torch version, torchvision version), it is beneficial to allow for tag customization to denote different build versions—which may also benefit from the then-shared build cache.To achieve this, this PR adds an optional
tag-suffix
parameter to the.github/workflows/build.yml
template. This allows for tags of the form[<branch name>-]<short commit hash>[-<custom suffix>]
.Examples
torch:base
builds are suffixed with a version string to result in the following format:Where
{}
is a placeholder for the respective library's version number.torch:nccl
builds have a similar format:Fix for UCC integration in
torch:nccl
buildstorch:nccl
builds now enable UCC integration, linking against HPC-X.This was intended to be enabled when the
torch-nccl
builds were added, but the right combination of flags had not been set (USE_SYSTEM_UCC
does not implyUSE_UCC
, for some reason), and furthermore a bug would have made it impossible to link against the UCX library innccl-tests
's HPC-X distribution anyway.The combined
torch
Dockerfile now contains a workaround to fix that bug whennccl-tests
is used as a base, and properly links against UCX.Build process optimizations
The
torch
builds have been optimized for size, speed, and cache-friendliness by making use of more build stages, as well as cross-stage filesystem mounts.Highlights: