Create Nvidia Branch for testing and development

Hi Eric, It's Brian Mahabir, it was a pleasure to meet you at Dev Conf!

I have a bare-bones fully working demo that utilizes Nvidia GPU support. I'd like to have a nv branch on ramalama repo to have a record of the development.

Although you can build llama.cpp with GPU support it doesn't work seamlessly. For example, it takes time to offload the work to Vram so when using GPU you would have to wait about 15-30 seconds before the chatbox appears. However, Ollama handles this much better albeit at the cost of some performance. The plan is to look at how Ollama handles GPU support and integrate those changes for a better experience as it's built ontop of llama.cpp.

Hey @bmahabirbu great to hear from you again. @swarajpande5 was looking into Nvidia support too, maybe you guys can look into it together. For branching, just fork the repo, create a branch and open a PR that way, that's generally how we do things.

Funnily enough I got AMD GPUs working today, by building a container like this:

FROM registry.access.redhat.com/ubi9/ubi:9.4-1214.1726694543

# renovate: datasource=github-releases depName=huggingface/huggingface_hub extractVersion=^v(?<version>.*)
ARG HUGGINGFACE_HUB_VERSION=0.25.1
# renovate: datasource=github-releases depName=containers/omlmd extractVersion=^v(?<version>.*)
ARG OMLMD_VERSION=0.1.5
# renovate: datasource=github-releases depName=tqdm/tqdm extractVersion=^v(?<version>.*)
ARG TQDM_VERSION=4.66.5
ARG LLAMA_CPP_SHA=70392f1f81470607ba3afef04aa56c9f65587664
# renovate: datasource=git-refs depName=ggerganov/whisper.cpp packageName=https://github.com/ggerganov/whisper.cpp gitRef=master versioning=loose type=digest
ARG WHISPER_CPP_SHA=5caa19240d55bfd6ee316d50fbad32c6e9c39528

# vulkan-headers vulkan-loader-devel vulkan-tools glslc glslang python3-pip mesa-libOpenCL-$MESA_VER.aarch64
RUN dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \
    crb enable && \
    dnf install -y epel-release && \
    dnf --enablerepo=ubi-9-appstream-rpms install -y git procps-ng vim \
      dnf-plugins-core python3-dnf-plugin-versionlock cmake gcc-c++ \
      python3-pip python3-argcomplete && \
    dnf clean all && \
    rm -rf /var/cache/*dnf*

RUN /usr/bin/python3 --version
RUN pip install "huggingface_hub[cli]==${HUGGINGFACE_HUB_VERSION}"
RUN pip install "omlmd==${OMLMD_VERSION}"
RUN pip install "tqdm==${TQDM_VERSION}"

ARG ROCM_VERSION=6.2.2
ARG AMDGPU_VERSION=6.2.2

RUN <<EOF
cat <<EOD > /etc/yum.repos.d/rocm.repo
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/rhel/9.4/main/x86_64/
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key

[ROCm]
name=ROCm
baseurl=https://repo.radeon.com/rocm/rhel9/$ROCM_VERSION/main
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOD
EOF

RUN dnf config-manager --add-repo https://mirror.stream.centos.org/9-stream/AppStream/$(uname -m)/os/
RUN curl -o /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Official http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-Official
RUN rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Official

RUN dnf install -y rocm && \
    dnf clean all && \
    rm -rf /var/cache/*dnf*

RUN git clone https://github.com/ggerganov/llama.cpp && \
    cd llama.cpp && \
    git reset --hard ${LLAMA_CPP_SHA} && \
    cmake -B build -DCMAKE_INSTALL_PREFIX:PATH=/usr -DGGML_CCACHE=0 \
      -DGGML_HIPBLAS=1 && \
    cmake --build build --config Release -j $(nproc) && \
    cmake --install build && \
    cd / && \
    rm -rf llama.cpp

RUN git clone https://github.com/ggerganov/whisper.cpp.git && \
    cd whisper.cpp && \
    git reset --hard ${WHISPER_CPP_SHA} && \
    make -j $(nproc) && \
    mv main /usr/bin/whisper-main && \
    mv server /usr/bin/whisper-server && \
    cd / && \
    rm -rf whisper.cpp

And then figuring out what is the GPU with the largest VRAM via:

rocm-smi --showmeminfo vram --json
{"card0": {"VRAM Total Memory (B)": "8573157376", "VRAM Total Used Memory (B)": "27136000"}, "card1": {"VRAM Total Memory (B)": "536870912", "VRAM Total Used Memory (B)": "371126272"}}

Selecting that GPU via:

export HIP_VISIBLE_DEVICES=0

and adding --ngl 99 to llama.cpp command-line. But all the bits still have to be put together properly.

Then this 3b model works with GPU offload via ROCM to AMD GPU:

/usr/bin/ramalama run llama3.2

Thats fantastic! What branch should I make the PR request to?

Heres the container I used

FROM nvidia/cuda:12.2.0-devel-ubi9

# renovate: datasource=github-releases depName=huggingface/huggingface_hub extractVersion=^v(?<version>.*)
ARG HUGGINGFACE_HUB_VERSION=0.25.0
# renovate: datasource=github-releases depName=containers/omlmd extractVersion=^v(?<version>.*)
ARG OMLMD_VERSION=0.1.4
# renovate: datasource=git-refs depName=ggerganov/llama.cpp packageName=https://github.com/ggerganov/llama.cpp gitRef=master versioning=loose type=digest
ARG LLAMA_CPP_SHA=32b2ec88bc44b086f3807c739daf28a1613abde1
# renovate: datasource=git-refs depName=ggerganov/whisper.cpp packageName=https://github.com/ggerganov/whisper.cpp gitRef=master versioning=loose type=digest
ARG WHISPER_CPP_SHA=5caa19240d55bfd6ee316d50fbad32c6e9c39528

# vulkan-headers vulkan-loader-devel vulkan-tools glslc glslang python3-pip mesa-libOpenCL-$MESA_VER.aarch64
RUN dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \
    crb enable && \
    dnf install -y epel-release && \
    dnf --enablerepo=ubi-9-appstream-rpms install -y git procps-ng vim \
      dnf-plugins-core python3-dnf-plugin-versionlock cmake gcc-c++ \
      python3-pip && \
    dnf clean all && \
    rm -rf /var/cache/*dnf*

RUN /usr/bin/python3 --version
RUN pip install "huggingface_hub[cli]==${HUGGINGFACE_HUB_VERSION}"
RUN pip install "omlmd==${OMLMD_VERSION}"

RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1
RUN LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs/

ENV GGML_CCACHE=0

# Build wouldnt complete with cmake even with nvidia container toolkit installed

RUN git clone https://github.com/ggerganov/llama.cpp && \
    cd llama.cpp && \
    git reset --hard ${LLAMA_CPP_SHA} && \
    make -j $(nproc) GGML_CUDA=1 CUDA_ARCH=ALL && \
    mv llama-cli /usr/bin/llama-cli && \
    mv llama-server /usr/bin/llama-server && \
    cd / && \
    rm -rf llama.cpp

RUN git clone https://github.com/ggerganov/whisper.cpp.git && \
    cd whisper.cpp && \
    git reset --hard ${WHISPER_CPP_SHA} && \
    make -j $(nproc) && \
    mv main /usr/bin/whisper-main && \
    mv server /usr/bin/whisper-server && \
    cd / && \
    rm -rf whisper.cpp

I didnt use rhel but I bet the change is minimal something like this

FROM registry.redhat.io/rhel9/rhel:9.4

# Install NVIDIA CUDA components
RUN yum -y install \
    cuda-12-2-devel \
    && yum clean all

# Continue

The only issue I ran into was building the image

So the container wouldnt build using cmake for llama.cpp even with nvidia-container-toolkit installed. It had some trouble finding libcuda.so.1 even when I specifically added a systemlink to it. I ended up using just the regular make and things worked out. I think it had something to do with the build config for llama.cpp cuda. Its great the amd build can use cmake properly.

Otherwise everything else looks very similar!

Great method to figure out vram and adjust the --ngl parameter for it.

How should I reach out to @swarajpande5 to get things rolling!

Thanks Eric!

@bmahabirbu open it against main branch, in somewhere like:

container-images/ramalama/latest-cuda/Containerfile

I think you did the right thing using nvidia/cuda image by the way. It is a UBI9 image at the end of the day and Nvidia do a good job at maintaining those. Although I think we should try the latest version:

nvidia/cuda:12.6.1-devel-ubi9

rather than:

nvidia/cuda:12.2.0-devel-ubi9

containers / ramalama

Create Nvidia Branch for testing and development #239