coreweave / ml-containers

MIT License
21 stars 3 forks source link

feat(llm-finetuner): Migrate LLM finetuner image from `kubernetes-cloud` #21

Open Eta0 opened 1 year ago

Eta0 commented 1 year ago

LLM Finetuner Container

This re-homes the container for coreweave/kubernetes-cloud's LLM finetuner by copying over its Dockerfile and compiler wrapper as they appeared in commit 6c10019 under the directory finetuner-workflow/finetuner in that repository, with some updates for cross-repository downloading added to the build.

Neat Things About the Build

The coreweave/kubernetes-cloud repository is absolutely massive. At the time of writing, git clone https://github.com/coreweave/kubernetes-cloud thwacks you with a 607 MiB download, primarily comprising nearly 400 MiB of image files under /docs and an almost 200 MiB .git directory. This is a bit over-the-top to download just a handful of files, so this container's build is configured to do sparse checkouts that reduce the download size 1000x, to a bit under 600 KiB, which is further reduced to just a few dozen kilobytes by deleting the .git directory at the end of the download step.

It's a nice improvement that could be integrated into this repository's sd-finetuner container build as well, which currently leaves that full 600+ MiB repository in its final image.

Weird Things About the Build

Building from a Branch

Branch names can but probably should not be used as commit identifiers for these builds, because Docker may cache the download by the branch's name, which isn't good if the branch has received updates and is expected to be re-downloaded in an updated state. The hash of the latest commit should be used instead.

Coupling

There is currently no default commit defined for the build, and accordingly, no rule to automatically rebuild the image on updates pushed here. The list of files copied during the build process is very specific and doesn't adapt very well between versions of the source. This could be alleviated a bit by copying over the entire finetuner-workflow/finetuner directory into the final image, but I still see this potentially becoming very annoying to manage between many possible concurrent branches in kubernetes-cloud that could each require distinct build instructions over here, and tracking down corresponding historical changes across two the repositories seems painful.

To make that better, we could work on making the build instructions very generic, like including a version-controlled install.sh (or something) over in kubernetes-cloud and running most of the work in there. Alternatively, the LLM finetuner could have its own repository with this container published in it.

Alternatively, this entire Dockerfile could be left in kubernetes-cloud, versioned with the rest of the source, and we could dynamically download it and build it here in ml-containers from any given commit entirely through a workflow, without any corresponding directory here (or maybe one with only a README). This would cut down on the headache of managing the source in multiple disconnected places while still keeping the container in the central ml-containers repository.

I'd welcome some thoughts on this point.