PromtEngineer / localGPT

Chat with your documents on your local device using GPT models. No data leaves your device and 100% private.
Apache License 2.0
20.08k stars 2.24k forks source link

Windows / CUDA multiple issues #287

Open andypotato opened 1 year ago

andypotato commented 1 year ago

I am running into multiple errors when trying to get localGPT to run on my Windows 11 / CUDA machine (3060 / 12 GB). Here is what I did so far:

Using this installation, I could run ingest.py and it built the index without any issue. So far so good!

I then changed model_id and model_basename as follows:

    model_id="TheBloke/Llama-2-13B-chat-GPTQ"
    model_basename = "gptq_model-4bit-128g.safetensors"

(I use this model in Oobabooga without any issues, it will easily fit on a 12 GB card)

Finally I try to run run_localGPT.py and this is where trouble starts:

(localgpt) D:\LocalGPT\localGPT>python run_localGPT.py

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
binary_path: C:\Users\as\AppData\Local\Programs\Python\Python310\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll
CUDA SETUP: Loading binary C:\Users\as\AppData\Local\Programs\Python\Python310\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll...
2023-07-26 13:31:26,742 - INFO - run_localGPT.py:176 - Running on: cuda
2023-07-26 13:31:26,742 - INFO - run_localGPT.py:177 - Display Source Documents set to: False
2023-07-26 13:31:27,040 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-07-26 13:31:28,900 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-07-26 13:31:28,912 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: D:\LocalGPT\localGPT/DB
2023-07-26 13:31:28,922 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-07-26 13:31:28,934 - INFO - json_impl.py:45 - Using orjson library for writing JSON byte strings
2023-07-26 13:31:28,968 - INFO - duckdb.py:460 - loaded in 72 embeddings
2023-07-26 13:31:28,970 - INFO - duckdb.py:472 - loaded in 1 collections
2023-07-26 13:31:28,971 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-07-26 13:31:28,971 - INFO - run_localGPT.py:43 - Loading Model: TheBloke/Llama-2-13B-chat-GPTQ, on: cuda
2023-07-26 13:31:28,971 - INFO - run_localGPT.py:44 - This action can take a few minutes!
2023-07-26 13:31:28,972 - INFO - run_localGPT.py:64 - Using AutoGPTQForCausalLM for quantized models
2023-07-26 13:31:29,300 - INFO - run_localGPT.py:71 - Tokenizer loaded
2023-07-26 13:31:30,689 - INFO - _base.py:727 - lm_head not been quantized, will be ignored when make_quant.

So far no (obvious) issues, however:

2023-07-26 13:31:30,690 - WARNING - qlinear_old.py:16 - CUDA extension not installed.

CUDA extension not installed?

2023-07-26 13:31:31,590 - WARNING - modeling.py:1093 - The safetensors archive passed at C:\Users\as/.cache\huggingface\hub\models--TheBloke--Llama-2-13B-chat-GPTQ\snapshots\01bfd1c28783056bf8817b6d487f0efbbabe1804\gptq_model-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.

This doesn't sound good?

2023-07-26 13:31:37,516 - WARNING - fused_llama_mlp.py:306 - skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].

Uh oh?

But then finally the prompt appears:

2023-07-26 13:31:37,809 - INFO - run_localGPT.py:123 - Local LLM Loaded

Enter a query:

After entering a prompt, my GPU usage will go to 100% but never produce any output. I waited for about 5 minutes but it was just stuck at the high GPU usage and I had to abort the script.

Any ideas what's going on here? I'll be happy to help debug, but I currently don't know where to start

PromtEngineer commented 1 year ago

@andypotato In order to debug, I would recommend to start with a 7B model first and see if you get any response. Also, can you try the ggml format?

andypotato commented 1 year ago

@PromtEngineer I did two more tests as you suggested:

1) I tried the 7B Llama GPTQ model and received the same debug output as above.

I tried the prompt: "What is the role of the senate" and saw the same 100% GPU usage for about 2 minutes but this time eventually got a good response. When I repeated the test, it produced this output:

Enter a query: what is the role of the senate?

> Question:
what is the role of the senate?

> Answer:
simp⋅"+...]AGE donner(| ///AGE quelquesAGE donner∆AGE donner"+...]"+...]"+...]AGE donnerAGE donnerAGE donner"+...]AGE donnerAGE donnerAGE donnerAGE donnersimp⋅simp⋅"+...]"+...]"+...]"+...]"+...]"+...]"+...]"+...]"+...]"+...]"+...]"+...]AGE donner"+...]"+...]AGE donner"+...]AGE donnerAGE donnerAGE donnerOb whilst labour←AGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donner~~ premiers\< PhogenericlistaAGE donnerAGE donner"+...]AGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donner"+...]"+...]AGE donner"+...]AGE donnerAGE donnerAGE donnerAGE donnerAGE donnerFileName ();anon ="AGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donner"+...]"+...]AGE donnerAGE donner"+...]"+...]"+...]"+...]"+...]"+ [lots more gibberish]

2) I tried Llama-2-7B-Chat-GGML (llama-2-7b-chat.ggmlv3.q4_0.bin), this time I used device_type = cpu to start run_localGPT.py

Same prompt: "What is the role of the senate" took about 2 minutes to generate a response. This time the CPU usage was at around a constant 90%.

Enter a query: what is the role of the senate

llama_print_timings:        load time =   921.44 ms
llama_print_timings:      sample time =    42.16 ms /   204 runs   (    0.21 ms per token,  4838.14 tokens per second)
llama_print_timings: prompt eval time = 99406.86 ms /  1176 tokens (   84.53 ms per token,    11.83 tokens per second)
llama_print_timings:        eval time = 37642.01 ms /   203 runs   (  185.43 ms per token,     5.39 tokens per second)
llama_print_timings:       total time = 137653.02 ms

Now I understand that CPU inference is slow, but is that the performance I should expect when using a GPU? The 3060 isn't the fastest GPU but it will usually generate around 15-20 tokens / s with a 4bit quantized 13b parameter model.

andypotato commented 1 year ago

The installation instructions do not work properly for Windows / CUDA systems. The following process will work:

Ensure you have the Nvidia CUDA runtime version 11.8 installed

nvcc --version

Should report a CUDA version of 11.8

Create the virtual environment using conda

conda create -n localGPT -y
conda activate localGPT
conda install python=3.10 -c conda-forge -y

Verify your Python installation

python --version

Should output Python 3.10.x

Install the CUDA toolkit

conda install cudatoolkit=11.7 -c conda-forge -y
set CUDA_HOME=%CONDA_PREFIX%

Install localGPT

git clone https://github.com/PromtEngineer/localGPT.git
cd localGPT
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu117

Replace Bitsandbytes

pip uninstall bitsandbytes-windows -y
pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.40.1.post1-py3-none-win_amd64.whl

Replace AutoGPTQ

pip install https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.3.0/auto_gptq-0.3.0+cu118-cp310-cp310-win_amd64.whl

Configure localGPT to use GPTQ model

Open run_localGPT.py and configure the model_id and model_basename

    model_id = "TheBloke/Llama-2-7b-Chat-GPTQ"
    model_basename = "gptq_model-4bit-128g.safetensors"

Now you should be able to run ingest.py and also run the localGPT inference

lafintiger commented 1 year ago

The above install worked for my flawlessly. The key was having the right python (3.10) and the right cuda (117). In my base system I have cuda 12.2 and python 3.11. The conda environment as is stated above made it work.

Thanks!!

frenchiveruti commented 1 year ago

@PromtEngineer Just copy paste this into the Readme.md because this steps are the only ones that work.

By the way @andypotato the ingest.py still runs with CPU, is there a way around it?

alienatedsec commented 1 year ago

Firstly, thank you for this Windows guide to accelerate with CUDA graphics @andypotato - you saved me hours and it took seconds to ingest the sample file. You can see the GPU memory and GPU processor increase below - it does take a while before GPU gets involved (I use the external drive as no space on my local SSDs). In comparison, it took minutes to do the same operation using the CPU.

Ensure you have the Nvidia CUDA runtime version 11.8 installed

I would like to highlight that it also works with CUDA version 12.2 - You can see it below.

ApplicationFrameHost_Gt3dOLnUc4