getumbrel / llama-gpt

A self-hosted, offline, ChatGPT-like chatbot. Powered by Llama 2. 100% private, with no data leaving your device. New: Code Llama support!
https://apps.umbrel.com/app/llama-gpt
MIT License
10.53k stars 666 forks source link

GPU Usage #115

Closed zchryr closed 8 months ago

zchryr commented 9 months ago

Hello,

First off I'd just like to say this project is absolutely fantastic.

I'm having a bit of trouble trying to get the GPU to be used. I have a 2080 super, and I am able to see that using nvidia-smi in the container once it's up and running. However I don't ever see processes utilizing the GPU, and I only see the CPU going up to 100% usage after I ask the AI a question.

Here is my docker-compose-cuda-gguf.yml

version: '3.6'

services:
  llama-gpt-api-cuda-gguf:
    image: ghcr.io/abetlen/llama-cpp-python:latest 
    # build:
    #   context: ./cuda
    #   dockerfile: gguf.Dockerfile
    restart: on-failure
    volumes:
      - './models:/models'
      - './cuda:/cuda'
    ports:
      - 3001:8000
    environment:
      MODEL: '/models/${MODEL_NAME:-code-llama-2-13b-chat.gguf}'
      MODEL_DOWNLOAD_URL: '${MODEL_DOWNLOAD_URL:-https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GGUF/resolve/main/codellama-13b-instruct.Q4_K_M.gguf}'
      N_GQA: '${N_GQA:-1}'
      USE_MLOCK: 1
    cap_add:
      - IPC_LOCK
      - SYS_RESOURCE
    command: '/bin/sh /cuda/run.sh'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  llama-gpt-ui:
    # TODO: Use this image instead of building from source after the next release
    image: 'ghcr.io/getumbrel/llama-gpt-ui:latest'
    # build:
    #   context: ./ui
    #   dockerfile: Dockerfile
    ports:
      - 3000:3000
    restart: on-failure
    environment:
      - 'OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXX'
      - 'OPENAI_API_HOST=http://llama-gpt-api-cuda-gguf:8000'
      - 'DEFAULT_MODEL=/models/${MODEL_NAME:-code-llama-2-13b-chat.gguf}'
      - 'NEXT_PUBLIC_DEFAULT_SYSTEM_PROMPT=${DEFAULT_SYSTEM_PROMPT:-"You are a helpful and friendly AI assistant. Respond very concisely."}'
      - 'WAIT_HOSTS=llama-gpt-api-cuda-gguf:8000'
      - 'WAIT_TIMEOUT=${WAIT_TIMEOUT:-3600}'
zchryr commented 9 months ago

And I've also tried setting n_gpu_layers to both 50 and 100, with no noticeable difference.

ProgrammingLife commented 4 months ago

Have you solved this issue?