The coreweave/kubernetes-cloud repository is absolutely massive. At the time of writing, git clone https://github.com/coreweave/kubernetes-cloud thwacks you with a 607 MiB download, primarily comprising nearly 400 MiB of image files under /docs and an almost 200 MiB .git directory.
This is a bit over-the-top to download just a handful of files, so this container's build is configured to do sparse checkouts that reduce the download size 1000x, to a bit under 600 KiB, which is further reduced to just a few dozen kilobytes by deleting the .git directory at the end of the download step.
It's a nice improvement that could be integrated into this repository's sd-finetuner container build as well, which currently leaves that full 600+ MiB repository in its final image.
Weird Things About the Build
Building from a Branch
Branch names can but probably should not be used as commit identifiers for these builds, because Docker may cache the download by the branch's name, which isn't good if the branch has received updates and is expected to be re-downloaded in an updated state. The hash of the latest commit should be used instead.
Coupling
There is currently no default commit defined for the build, and accordingly, no rule to automatically rebuild the image on updates pushed here. The list of files copied during the build process is very specific and doesn't adapt very well between versions of the source.
This could be alleviated a bit by copying over the entire finetuner-workflow/finetuner directory into the final image, but I still see this potentially becoming very annoying to manage between many possible concurrent branches in kubernetes-cloud that could each require distinct build instructions over here, and tracking down corresponding historical changes across two the repositories seems painful.
To make that better, we could work on making the build instructions very generic, like including a version-controlled install.sh (or something) over in kubernetes-cloud and running most of the work in there. Alternatively, the LLM finetuner could have its own repository with this container published in it.
Alternatively, this entire Dockerfile could be left in kubernetes-cloud, versioned with the rest of the source, and we could dynamically download it and build it here in ml-containers from any given commit entirely through a workflow, without any corresponding directory here (or maybe one with only a README). This would cut down on the headache of managing the source in multiple disconnected places while still keeping the container in the central ml-containers repository.
LLM Finetuner Container
This re-homes the container for
coreweave/kubernetes-cloud
's LLM finetuner by copying over itsDockerfile
and compiler wrapper as they appeared in commit 6c10019 under the directoryfinetuner-workflow/finetuner
in that repository, with some updates for cross-repository downloading added to the build.Neat Things About the Build
The
coreweave/kubernetes-cloud
repository is absolutely massive. At the time of writing,git clone https://github.com/coreweave/kubernetes-cloud
thwacks you with a 607 MiB download, primarily comprising nearly 400 MiB of image files under/docs
and an almost 200 MiB.git
directory. This is a bit over-the-top to download just a handful of files, so this container's build is configured to do sparse checkouts that reduce the download size 1000x, to a bit under 600 KiB, which is further reduced to just a few dozen kilobytes by deleting the.git
directory at the end of the download step.It's a nice improvement that could be integrated into this repository's
sd-finetuner
container build as well, which currently leaves that full 600+ MiB repository in its final image.Weird Things About the Build
Building from a Branch
Branch names can but probably should not be used as commit identifiers for these builds, because Docker may cache the download by the branch's name, which isn't good if the branch has received updates and is expected to be re-downloaded in an updated state. The hash of the latest commit should be used instead.
Coupling
There is currently no default commit defined for the build, and accordingly, no rule to automatically rebuild the image on updates pushed here. The list of files copied during the build process is very specific and doesn't adapt very well between versions of the source. This could be alleviated a bit by copying over the entire
finetuner-workflow/finetuner
directory into the final image, but I still see this potentially becoming very annoying to manage between many possible concurrent branches inkubernetes-cloud
that could each require distinct build instructions over here, and tracking down corresponding historical changes across two the repositories seems painful.To make that better, we could work on making the build instructions very generic, like including a version-controlled
install.sh
(or something) over inkubernetes-cloud
and running most of the work in there. Alternatively, the LLM finetuner could have its own repository with this container published in it.Alternatively, this entire Dockerfile could be left in
kubernetes-cloud
, versioned with the rest of the source, and we could dynamically download it and build it here inml-containers
from any given commit entirely through a workflow, without any corresponding directory here (or maybe one with only aREADME
). This would cut down on the headache of managing the source in multiple disconnected places while still keeping the container in the centralml-containers
repository.I'd welcome some thoughts on this point.