I've created a Docker file for CUDA support. There are a few changes that would need to be made in the Cargo.toml also.
Here are the changes for the Cargo.toml:
# Change the features to cublas
llm = { git = "https://github.com/rustformers/llm.git", branch = "main", optional = true, features = ["cublas"] }
# change the site-addr to listen to all
site-addr = "0.0.0.0:3000"
Here's the docker file:
FROM nvidia/cuda:12.2.0-devel-ubuntu20.04
SHELL ["/bin/bash", "-ec"]
ARG DEBIAN_FRONTEND=noninteractive
WORKDIR /usr/src/app
RUN mkdir /usr/local/models
#RUN touch ~/.bashrc
RUN apt-get update
RUN apt-get install -y git build-essential curl libssl-dev pkg-config vim
RUN apt-get update
RUN curl --proto '=https' --tlsv1.3 https://sh.rustup.rs -sSf | bash -s -- -y
ENV PATH="/root/.cargo/bin:$PATH"
# install NodeJS
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash
RUN apt-get install -y nodejs
# CUDA GPU enabling cuBLAS
ENV PATH="$PATH:/usr/local/cuda/bin"
ENV CUDACXX=/usr/local/cuda/bin/nvcc
COPY . .
RUN rustup toolchain install nightly
RUN rustup target add wasm32-unknown-unknown
RUN cargo install trunk cargo-leptos
RUN source ~/.bashrc && npm install
RUN npx tailwindcss -i ./input.css -o ./style/output.css
EXPOSE 3000/tcp
CMD ["cargo", "leptos", "watch"]
I'm not all the familiar with Rust... So i'm just using the cargo leptos watch command to run this. I figure you can run the cargo leptos build then start the server that way... but maybe i can just let you guys take this and run with it.
Docker build:
docker build -t rusty_llama .
Docker run:
docker run -it --rm -v /usr/local/models/GGML_Models:/usr/local/models -e MODEL_PATH="/usr/local/models/nous-hermes-llama-2-7b.ggmlv3.q8_0.bin" -p 3000:3000 --runtime=nvidia --gpus all rusty_llama bash
This will drop you into bash so you can run it from the container with cargo leptos watch. I do see the tensor cores loading and my GPU VRAM loads up the model... However, it's not using it for queries. I see the CPU still being used.
I've created a Docker file for CUDA support. There are a few changes that would need to be made in the Cargo.toml also.
Here are the changes for the Cargo.toml:
Here's the docker file:
I'm not all the familiar with Rust... So i'm just using the
cargo leptos watch
command to run this. I figure you can run thecargo leptos build
then start the server that way... but maybe i can just let you guys take this and run with it.Docker build:
Docker run:
This will drop you into
bash
so you can run it from the container withcargo leptos watch
. I do see the tensor cores loading and my GPU VRAM loads up the model... However, it's not using it for queries. I see the CPU still being used.Output from
cargo leptos watch
:nvidia-smi output after loading the model:
I'm curious if there's an issue with the session.infer() ?