MoonKraken / rusty_llama

A simple ChatGPT clone in Rust on both the frontend and backend. Uses open source language models and TailwindCSS.
MIT License
400 stars 74 forks source link

Docker container running cuda support #8

Open cwysong85 opened 1 year ago

cwysong85 commented 1 year ago

I've created a Docker file for CUDA support. There are a few changes that would need to be made in the Cargo.toml also.

Here are the changes for the Cargo.toml:

# Change the features to cublas
llm = { git = "https://github.com/rustformers/llm.git", branch = "main", optional = true, features = ["cublas"] }

# change the site-addr to listen to all
site-addr = "0.0.0.0:3000"

Here's the docker file:

FROM nvidia/cuda:12.2.0-devel-ubuntu20.04

SHELL ["/bin/bash", "-ec"]

ARG DEBIAN_FRONTEND=noninteractive

WORKDIR /usr/src/app
RUN mkdir /usr/local/models

#RUN touch ~/.bashrc

RUN apt-get update
RUN apt-get install -y git build-essential curl libssl-dev pkg-config vim
RUN apt-get update
RUN curl --proto '=https' --tlsv1.3 https://sh.rustup.rs -sSf | bash -s -- -y
ENV PATH="/root/.cargo/bin:$PATH"

# install NodeJS
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash
RUN apt-get install -y nodejs

# CUDA GPU enabling cuBLAS
ENV PATH="$PATH:/usr/local/cuda/bin"
ENV CUDACXX=/usr/local/cuda/bin/nvcc

COPY . .

RUN rustup toolchain install nightly
RUN rustup target add wasm32-unknown-unknown
RUN cargo install trunk cargo-leptos

RUN source ~/.bashrc && npm install

RUN npx tailwindcss -i ./input.css -o ./style/output.css

EXPOSE 3000/tcp

CMD ["cargo", "leptos", "watch"]

I'm not all the familiar with Rust... So i'm just using the cargo leptos watch command to run this. I figure you can run the cargo leptos build then start the server that way... but maybe i can just let you guys take this and run with it.

Docker build:

docker build -t rusty_llama .

Docker run:

docker run -it --rm -v /usr/local/models/GGML_Models:/usr/local/models -e MODEL_PATH="/usr/local/models/nous-hermes-llama-2-7b.ggmlv3.q8_0.bin" -p 3000:3000 --runtime=nvidia --gpus all rusty_llama bash

This will drop you into bash so you can run it from the container with cargo leptos watch. I do see the tensor cores loading and my GPU VRAM loads up the model... However, it's not using it for queries. I see the CPU still being used.

Output from cargo leptos watch:

Loaded hyperparameters
ggml ctx size = 0.07 MB

ggml_init_cublas: found 3 CUDA devices:
  Device 0: Tesla T4
  Device 1: Tesla T4
  Device 2: Tesla T4
Loaded tensor 8/291
Loaded tensor 16/291
Loaded tensor 24/291
Loaded tensor 32/291
Loaded tensor 40/291
Loaded tensor 48/291
Loaded tensor 56/291
Loaded tensor 64/291
Loaded tensor 72/291
Loaded tensor 80/291
Loaded tensor 88/291
Loaded tensor 96/291
Loaded tensor 104/291
Loaded tensor 112/291
Loaded tensor 120/291
Loaded tensor 128/291
Loaded tensor 136/291
Loaded tensor 144/291
Loaded tensor 152/291
Loaded tensor 160/291
Loaded tensor 168/291
Loaded tensor 176/291
Loaded tensor 184/291
Loaded tensor 192/291
Loaded tensor 200/291
Loaded tensor 208/291
Loaded tensor 216/291
Loaded tensor 224/291
Loaded tensor 232/291
Loaded tensor 240/291
Loaded tensor 248/291
Loaded tensor 256/291
Loaded tensor 264/291
Loaded tensor 272/291
Loaded tensor 280/291
Loaded tensor 288/291
Loading of model complete
Model size = 6829.07 MB / num tensors = 291
[2023-08-08T14:33:00Z INFO  actix_server::builder] starting 48 workers
[2023-08-08T14:33:00Z INFO  actix_server::server] Actix runtime found; starting in Actix runtime

nvidia-smi output after loading the model:

Tue Aug  8 14:38:22 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:3B:00.0 Off |                  Off |
| N/A   46C    P0              27W /  70W |    117MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:87:00.0 Off |                  Off |
| N/A   46C    P0              27W /  70W |    117MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla T4                       Off | 00000000:AF:00.0 Off |                  Off |
| N/A   46C    P0              26W /  70W |    117MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3100066      C   target/server/debug/leptos_start            112MiB |
|    1   N/A  N/A   3100066      C   target/server/debug/leptos_start            112MiB |
|    2   N/A  N/A   3100066      C   target/server/debug/leptos_start            112MiB |
+---------------------------------------------------------------------------------------+

I'm curious if there's an issue with the session.infer() ?

stappersg commented 1 year ago

@cwysong85 consider to split the issue into merge requests and smaller issues.