ServiceNow / Fast-LLM

Accelerating your LLM training to full speed
https://servicenow.github.io/Fast-LLM/
Other
38 stars 5 forks source link

Optimize Docker Build Layers and Add Sudo Privileges for Fast-LLM Container #2

Closed tscholak closed 1 month ago

tscholak commented 1 month ago

I'd like to refine the Dockerfile slightly to improve build efficiency and add runtime flexibility for the Fast-LLM container. The changes are small but impactful, focusing on two main improvements:

  1. Improved Build Layering for Faster Rebuilds:
    • The build process is now split into two distinct stages:
      1. Dependency installation (based on setup.py, setup.cfg, pyproject.toml) is done first.
      2. Fast-LLM code installation is done last, by using the new --exclude= option enabled by Dockerfile syntax version 1.7-labs.
    • With this change the dependencies don't need to be reinstalled when the Fast-LLM source code changes. That can reduce rebuild times significantly since code changes land in different Docker image layers than dependencies.
  2. Added Sudo Privileges for Fast-LLM User:
    • Introduced password-less sudo privileges to the fast_llm user. This addition allows system adjustments (e.g., modifying system limits or adjusting host settings) directly from within the container.
    • I found this very useful in bare Kubernetes environments (like LambdaLabs), where I needed to frequently make changes to system configurations (such as those controllable with ulimit) that do not persist across container restarts.

Here's a breakdown of the build time:

docker build --platform linux/amd64 -t torstenscholak663/fast-llm:latest --build-arg FAST_LLM_USER_ID=1000 .                                   
[+] Building 48.3s (23/23) FINISHED                                                                                                                                                                                                                               docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                                                                                                              0.0s
 => => transferring dockerfile: 1.48kB                                                                                                                                                                                                                                            0.0s
 => resolve image config for docker-image://docker.io/docker/dockerfile:1.7-labs                                                                                                                                                                                                  0.7s
 => [auth] docker/dockerfile:pull token for registry-1.docker.io                                                                                                                                                                                                                  0.0s
 => CACHED docker-image://docker.io/docker/dockerfile:1.7-labs@sha256:b99fecfe00268a8b556fad7d9c37ee25d716ae08a5d7320e6d51c4dd83246894                                                                                                                                            0.0s
 => [internal] load metadata for nvcr.io/nvidia/pytorch:24.07-py3                                                                                                                                                                                                                 1.0s
 => [internal] load .dockerignore                                                                                                                                                                                                                                                 0.0s
 => => transferring context: 163B                                                                                                                                                                                                                                                 0.0s
 => [ 1/15] FROM nvcr.io/nvidia/pytorch:24.07-py3@sha256:f47441c102a810a27758b0b6274d46012ac15fd467119b2e1f0467be82bc8af3                                                                                                                                                         0.0s
 => [internal] load build context                                                                                                                                                                                                                                                 0.0s
 => => transferring context: 12.73kB                                                                                                                                                                                                                                              0.0s
 => CACHED [ 2/15] RUN apt-get update     && apt-get install --no-install-recommends -y git-lfs sudo util-linux     && rm -rf /var/lib/apt/lists/*     && git lfs install                                                                                                         0.0s
 => CACHED [ 3/15] RUN useradd -m -u 1000 -s /bin/bash fast_llm     && echo 'fast_llm ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers                                                                                                                                                    0.0s
 => CACHED [ 4/15] WORKDIR /app                                                                                                                                                                                                                                                   0.0s
 => [ 5/15] COPY --chown=fast_llm ./fast_llm/csrc/ fast_llm/csrc/                                                                                                                                                                                                                 0.1s
 => [ 6/15] RUN make -C ./fast_llm/csrc/                                                                                                                                                                                                                                          4.9s
 => [ 7/15] COPY --chown=fast_llm setup.py setup.cfg ./                                                                                                                                                                                                                           0.0s
 => [ 8/15] RUN PIP_NO_INPUT=1 pip3 install --no-cache-dir ".[CORE,OPTIONAL,DEV]"                                                                                                                                                                                                35.7s
 => [ 9/15] COPY --chown=fast_llm ./Megatron-LM Megatron-LM                                                                                                                                                                                                                       0.0s 
 => [10/15] COPY --chown=fast_llm ./examples examples                                                                                                                                                                                                                             0.0s 
 => [11/15] COPY --chown=fast_llm ./tests tests                                                                                                                                                                                                                                   0.0s 
 => [12/15] COPY --chown=fast_llm ./tools tools                                                                                                                                                                                                                                   0.0s 
 => [13/15] COPY --exclude=./fast_llm/csrc/ --chown=fast_llm ./fast_llm/ fast_llm/                                                                                                                                                                                                0.0s 
 => [14/15] COPY --chown=fast_llm pyproject.toml ./                                                                                                                                                                                                                               0.0s 
 => [15/15] RUN PIP_NO_INPUT=1 pip3 install --no-deps -e .                                                                                                                                                                                                                        4.6s
 => exporting to image                                                                                                                                                                                                                                                            1.0s
 => => exporting layers                                                                                                                                                                                                                                                           1.0s
 => => writing image sha256:f9b20cc3ca3c99ad8d3788cb6eacf5f48d518f48aec3bc3250f8d1d0d7cedeb3                                                                                                                                                                                      0.0s 
 => => naming to docker.io/torstenscholak663/fast-llm:latest                                                                                                                                                                                                                      0.0s 
jlamypoirier commented 1 month ago
  • With this change the dependencies don't need to be reinstalled when the Fast-LLM source code changes. That can reduce rebuild times significantly since code changes land in different Docker image layers than dependencies.

Not sure I'm following here, was it not the case already?

tscholak commented 1 month ago

Not sure I'm following here, was it not the case already?

People were telling me it was not. I never checked those claims, I just reworked the Dockerfile such that it was clear and sure that we wouldn't always rebuild everything on small code changes. Looks like it wasn't truly necessary. I removed those changes.