InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
3.99k stars 363 forks source link

[Feature] Create Cuda 12 docker images #1709

Open nickmitchko opened 2 months ago

nickmitchko commented 2 months ago

Motivation

Create container images for CUDA 12+ versions. Currently to run a cuda-12 container image one must build their own and edit the dockerfile.

Related resources

I propose adding a new dockerfile for specific cuda versions: docker/Dockerfile

# CUDA 12.3 example --
# Uses system python version
- FROM nvcr.io/nvidia/tritonserver:22.12-py3
+ FROM nvcr.io/nvidia/tritonserver:24.04-py3

RUN rm /etc/apt/sources.list.d/cuda*.list && apt-get update && apt-get install -y --no-install-recommends \
-    rapidjson-dev libgoogle-glog-dev gdb python3.8-venv \
+    rapidjson-dev libgoogle-glog-dev gdb python3-venv \
-    && rm -rf /var/lib/apt/lists/* && cd /opt && python3 -m venv py38
+    && rm -rf /var/lib/apt/lists/* && cd /opt && python3 -m venv py3

- ENV PATH=/opt/py38/bin:$PATH
+ ENV PATH=/opt/py3/bin:$PATH

RUN python3 -m pip install --no-cache-dir --upgrade pip &&\
    python3 -m pip install --no-cache-dir torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121 &&\
    python3 -m pip install --no-cache-dir cmake packaging wheel &&\
+    python3 -m pip install --no-cache-dir timm flash-attn # For Quality of Life

Additional context

No response

lvhan028 commented 2 months ago

Hi, thanks for the proposal. We can provide the CUDA 12 dockerfile as needed However, I have some concerns about the inclusion of timm and flash-attn.

LMDeploy has its own implementation for flash-attn, making it unnecessary for our use case. Additionally, while timm is used by some vision-language models, it's not universally adopted, suggesting that we might want to consider alternatives or customizations tailored to the specific requirements

nickmitchko commented 2 months ago

Hi @lvhan028, excluding timm and flash-attn are probably good for most use cases. It would be nice to have the following tags for the container:

With InternVL-1.5 it was a struggle getting the container in the proper state. I will share my build files here for reference in a bit.

lvhan028 commented 2 months ago

Sure. We can follow the tag rules you mentioned above.

nickmitchko commented 2 months ago
FROM nvcr.io/nvidia/tritonserver:24.04-py3

RUN rm /etc/apt/sources.list.d/cuda*.list && apt-get update && apt-get install -y --no-install-recommends \
    rapidjson-dev libgoogle-glog-dev gdb python3-venv \
    && rm -rf /var/lib/apt/lists/* && cd /opt && python3 -m venv py3

ENV PATH=/opt/py3/bin:$PATH

ARG CUDA_VERSION_MAJOR=12
ARG CUDA_VERSION_MINOR=4

ENV CUDA_VERSION_MAJOR=${CUDA_VERSION_MAJOR}
ENV CUDA_VERSION_MINOR=${CUDA_VERSION_MINOR}

RUN python3 -m pip install --no-cache-dir --upgrade pip &&\
    python3 -m pip install --no-cache-dir torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121 &&\
    python3 -m pip install --no-cache-dir cmake packaging wheel &&\
    python3 -m pip install --no-cache-dir timm flash-attn

ENV NCCL_LAUNCH_MODE=GROUP

# Should be in the lmdeploy root directory when building docker image
COPY . /opt/lmdeploy

WORKDIR /opt/lmdeploy

RUN cd /opt/lmdeploy &&\
    python3 -m pip install --no-cache-dir -r requirements.txt &&\
    mkdir -p build && cd build &&\
    cmake .. \
        -DCMAKE_BUILD_TYPE=RelWithDebInfo \
        -DCMAKE_EXPORT_COMPILE_COMMANDS=1 \
        -DCMAKE_INSTALL_PREFIX=/opt/tritonserver \
        -DBUILD_PY_FFI=ON \
        -DCUDA_VERSION_MAJOR=${CUDA_VERSION_MAJOR} \
        -DCUDA_VERSION_MINOR=${CUDA_VERSION_MINOR} \
        -DBUILD_MULTI_GPU=ON \
        -DBUILD_CUTLASS_MOE=OFF \
        -DBUILD_CUTLASS_MIXED_GEMM=OFF \
        -DCMAKE_CUDA_FLAGS="-lineinfo" \
        -DUSE_NVTX=ON &&\
    make -j$(nproc) && make install &&\
    cd .. &&\
    python3 -m pip install -e . &&\
    rm -rf build

ENV LD_LIBRARY_PATH=/opt/tritonserver/lib:$LD_LIBRARY_PATH
josephrocca commented 2 months ago

Currently to run a cuda-12 container image one must build their own and edit the dockerfile.

Hmm, I may be misunderstanding or naive here, but if Official image is made for 11.8, and the host has >= 11.8, then shouldn't the image work fine due to backwards compatibility? I've just tested the latest official image (openmmlab/lmdeploy:v0.4.2) works fine for me on CUDA 12.0, 12.1, 12.2, 12.3, and 12.4 machines from Runpod, except for 4090 on 12.3 for some reason, which I just opened an issue about.

These numbers are the result of nvidia-smi from inside the Official Docker image on various GPUs from Runpod:

✅ GPU: L40     Driver Version: 525.116.04     CUDA Version: 12.0 
✅ GPU: 2x3090  Driver Version: 525.85.12    CUDA Version: 12.0
✅ GPU: 2x4090  Driver Version: 525.125.06     CUDA Version: 12.0
❌ GPU: 2x4090  Driver Version: 545.29.06      CUDA Version: 12.3
❌ GPU: 2x4090  Driver Version: 545.23.08    CUDA Version: 12.3
✅ GPU: A30     Driver Version: 545.23.08    CUDA Version: 12.3
✅ GPU: L40     Driver Version: 545.23.08      CUDA Version: 12.3
✅ GPU: A6000   Driver Version: 550.54.15      CUDA Version: 12.4
✅ GPU: 2x4090  Driver Version: 550.54.15      CUDA Version: 12.4