Docker container running cuda support

I've created a Docker file for CUDA support. There are a few changes that would need to be made in the Cargo.toml also.

Here are the changes for the Cargo.toml:

# Change the features to cublas
llm = { git = "https://github.com/rustformers/llm.git", branch = "main", optional = true, features = ["cublas"] }

# change the site-addr to listen to all
site-addr = "0.0.0.0:3000"

Here's the docker file:

FROM nvidia/cuda:12.2.0-devel-ubuntu20.04

SHELL ["/bin/bash", "-ec"]

ARG DEBIAN_FRONTEND=noninteractive

WORKDIR /usr/src/app
RUN mkdir /usr/local/models

#RUN touch ~/.bashrc

RUN apt-get update
RUN apt-get install -y git build-essential curl libssl-dev pkg-config vim
RUN apt-get update
RUN curl --proto '=https' --tlsv1.3 https://sh.rustup.rs -sSf | bash -s -- -y
ENV PATH="/root/.cargo/bin:$PATH"

# install NodeJS
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash
RUN apt-get install -y nodejs

# CUDA GPU enabling cuBLAS
ENV PATH="$PATH:/usr/local/cuda/bin"
ENV CUDACXX=/usr/local/cuda/bin/nvcc

COPY . .

RUN rustup toolchain install nightly
RUN rustup target add wasm32-unknown-unknown
RUN cargo install trunk cargo-leptos

RUN source ~/.bashrc && npm install

RUN npx tailwindcss -i ./input.css -o ./style/output.css

EXPOSE 3000/tcp

CMD ["cargo", "leptos", "watch"]

I'm not all the familiar with Rust... So i'm just using the cargo leptos watch command to run this. I figure you can run the cargo leptos build then start the server that way... but maybe i can just let you guys take this and run with it.

Docker build:

docker build -t rusty_llama .

Docker run:

docker run -it --rm -v /usr/local/models/GGML_Models:/usr/local/models -e MODEL_PATH="/usr/local/models/nous-hermes-llama-2-7b.ggmlv3.q8_0.bin" -p 3000:3000 --runtime=nvidia --gpus all rusty_llama bash

This will drop you into bash so you can run it from the container with cargo leptos watch. I do see the tensor cores loading and my GPU VRAM loads up the model... However, it's not using it for queries. I see the CPU still being used.

Output from cargo leptos watch:

Loaded hyperparameters
ggml ctx size = 0.07 MB

ggml_init_cublas: found 3 CUDA devices:
  Device 0: Tesla T4
  Device 1: Tesla T4
  Device 2: Tesla T4
Loaded tensor 8/291
Loaded tensor 16/291
Loaded tensor 24/291
Loaded tensor 32/291
Loaded tensor 40/291
Loaded tensor 48/291
Loaded tensor 56/291
Loaded tensor 64/291
Loaded tensor 72/291
Loaded tensor 80/291
Loaded tensor 88/291
Loaded tensor 96/291
Loaded tensor 104/291
Loaded tensor 112/291
Loaded tensor 120/291
Loaded tensor 128/291
Loaded tensor 136/291
Loaded tensor 144/291
Loaded tensor 152/291
Loaded tensor 160/291
Loaded tensor 168/291
Loaded tensor 176/291
Loaded tensor 184/291
Loaded tensor 192/291
Loaded tensor 200/291
Loaded tensor 208/291
Loaded tensor 216/291
Loaded tensor 224/291
Loaded tensor 232/291
Loaded tensor 240/291
Loaded tensor 248/291
Loaded tensor 256/291
Loaded tensor 264/291
Loaded tensor 272/291
Loaded tensor 280/291
Loaded tensor 288/291
Loading of model complete
Model size = 6829.07 MB / num tensors = 291
[2023-08-08T14:33:00Z INFO  actix_server::builder] starting 48 workers
[2023-08-08T14:33:00Z INFO  actix_server::server] Actix runtime found; starting in Actix runtime

nvidia-smi output after loading the model:

Tue Aug  8 14:38:22 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:3B:00.0 Off |                  Off |
| N/A   46C    P0              27W /  70W |    117MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:87:00.0 Off |                  Off |
| N/A   46C    P0              27W /  70W |    117MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla T4                       Off | 00000000:AF:00.0 Off |                  Off |
| N/A   46C    P0              26W /  70W |    117MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3100066      C   target/server/debug/leptos_start            112MiB |
|    1   N/A  N/A   3100066      C   target/server/debug/leptos_start            112MiB |
|    2   N/A  N/A   3100066      C   target/server/debug/leptos_start            112MiB |
+---------------------------------------------------------------------------------------+

I'm curious if there's an issue with the session.infer() ?

MoonKraken / rusty_llama

Docker container running cuda support #8