SamurAIGPT / EmbedAI

An app to interact privately with your documents using the power of GPT, 100% privately, no data leaks
https://www.thesamur.ai/?utm_source=github&utm_medium=link&utm_campaign=github_privategpt
MIT License
2.78k stars 298 forks source link

how to use gpu instead cpu #7

Open mnofrizal opened 1 year ago

mnofrizal commented 1 year ago

can we use the gpu to get response more faster than use cpu ?

Anil-matcha commented 1 year ago

GPT4All doesn't support GPU acceleration. Will add support for models like Llama which can do this

bradsec commented 1 year ago

I was able to get GPU working with this Llama model: ggml-vic13b-q5_1.bin using a manual workaround.

# Download the ggml-vic13b-q5_1.bin model and place in privateGPT/server/models/
# Edit privateGPT.py and comment out GPT4 model and add LLama model
# Change n_gpu_layers=40 layers based on what Nvidia GPU (max is 40). Uses about 9GB VRAM.

def load_model():
    filename = 'ggml-vic13b-q5_1.bin'  # Specify the name for the downloaded file
    models_folder = 'models'  # Specify the name of the folder inside the Flask app root
    file_path = f'{models_folder}/{filename}'
    if os.path.exists(file_path):
        global llm
        callbacks = [StreamingStdOutCallbackHandler()]
        #llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False)
        llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_gpu_layers=40, callbacks=callbacks, verbose=False)

# Edit privateGPT/server/.env

# Update .env as follows
PERSIST_DIRECTORY=db
MODEL_TYPE=LlamaCpp
MODEL_PATH=models/ggml-vic13b-q5_1.bin
EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2
MODEL_N_CTX=1000

# If using conda enviroment
conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit

# Remove and reinstall llama-cpp-python with ENV variables set
# Linux uses "export" not "set" like Windows for setting environment variables

pip uninstall llama-cpp-python
export CMAKE_ARGS="-DLLAMA_CUBLAS=on"
export FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir

Run python privateGPT from privateGPT/server/ directory You should see the following lines in output as the model loads

llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 9076 MB
jackyoung022 commented 1 year ago

Hi, thanks for your info. But when I was following your step in my windows, I got this error: Could not load Llama model from path: D:/code/privateGPT/server/models/ggml-vic13b-q5_1.bin. Received error (type=value_error) Any idea about this? Thanks.

MyraBaba commented 1 year ago

@bradsec

Hi,

I followed the instructions but looks still using cpu :

(venPrivateGPT) (base) alp2080@alp2080:~/data/dProjects/privateGPT/server$ python privateGPT.py /data/dProjects/privateGPT/server/privateGPT.py:1: DeprecationWarning: 'flask.Markup' is deprecated and will be removed in Flask 2.4. Import 'markupsafe.Markup' instead. from flask import Flask,jsonify, render_template, flash, redirect, url_for, Markup, request llama.cpp: loading model from models/ggml-vic13b-q5_1.bin llama_model_load_internal: format = ggjt v2 (pre #1508) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 1000 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 9 (mostly Q5_1) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.09 MB llama_model_load_internal: mem required = 11359.05 MB (+ 1608.00 MB per state) llama_new_context_with_model: kv self size = 781.25 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | LLM0 LlamaCpp Params: {'model_path': 'models/ggml-vic13b-q5_1.bin', 'suffix': None, 'max_tokens': 256, 'temperature': 0.8, 'top_p': 0.95, 'logprobs': None, 'echo': False, 'stop_sequences': [], 'repeat_penalty': 1.1, 'top_k': 40}

  • Serving Flask app 'privateGPT'
  • Debug mode: off WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
  • Running on all addresses (0.0.0.0)
  • Running on http://127.0.0.1:5000
  • Running on http://192.168.5.110:5000 Press CTRL+C to quit Loading documents from source_documents
Musty1 commented 1 year ago

I tried this as well and it looks like it's still using CPU.. interesting. If anyone could suggest as to why it's not working with gpu, please let me know.