Open andypotato opened 1 year ago
@andypotato In order to debug, I would recommend to start with a 7B model first and see if you get any response. Also, can you try the ggml format?
@PromtEngineer I did two more tests as you suggested:
1) I tried the 7B Llama GPTQ model and received the same debug output as above.
I tried the prompt: "What is the role of the senate" and saw the same 100% GPU usage for about 2 minutes but this time eventually got a good response. When I repeated the test, it produced this output:
Enter a query: what is the role of the senate?
> Question:
what is the role of the senate?
> Answer:
simp⋅"+...]AGE donner(| ///AGE quelquesAGE donner∆AGE donner"+...]"+...]"+...]AGE donnerAGE donnerAGE donner"+...]AGE donnerAGE donnerAGE donnerAGE donnersimp⋅simp⋅"+...]"+...]"+...]"+...]"+...]"+...]"+...]"+...]"+...]"+...]"+...]"+...]AGE donner"+...]"+...]AGE donner"+...]AGE donnerAGE donnerAGE donnerOb whilst labour←AGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donner~~ premiers\< PhogenericlistaAGE donnerAGE donner"+...]AGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donner"+...]"+...]AGE donner"+...]AGE donnerAGE donnerAGE donnerAGE donnerAGE donnerFileName ();anon ="AGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donnerAGE donner"+...]"+...]AGE donnerAGE donner"+...]"+...]"+...]"+...]"+...]"+ [lots more gibberish]
2) I tried Llama-2-7B-Chat-GGML (llama-2-7b-chat.ggmlv3.q4_0.bin), this time I used device_type = cpu to start run_localGPT.py
Same prompt: "What is the role of the senate" took about 2 minutes to generate a response. This time the CPU usage was at around a constant 90%.
Enter a query: what is the role of the senate
llama_print_timings: load time = 921.44 ms
llama_print_timings: sample time = 42.16 ms / 204 runs ( 0.21 ms per token, 4838.14 tokens per second)
llama_print_timings: prompt eval time = 99406.86 ms / 1176 tokens ( 84.53 ms per token, 11.83 tokens per second)
llama_print_timings: eval time = 37642.01 ms / 203 runs ( 185.43 ms per token, 5.39 tokens per second)
llama_print_timings: total time = 137653.02 ms
Now I understand that CPU inference is slow, but is that the performance I should expect when using a GPU? The 3060 isn't the fastest GPU but it will usually generate around 15-20 tokens / s with a 4bit quantized 13b parameter model.
The installation instructions do not work properly for Windows / CUDA systems. The following process will work:
Ensure you have the Nvidia CUDA runtime version 11.8 installed
nvcc --version
Should report a CUDA version of 11.8
Create the virtual environment using conda
conda create -n localGPT -y
conda activate localGPT
conda install python=3.10 -c conda-forge -y
Verify your Python installation
python --version
Should output Python 3.10.x
Install the CUDA toolkit
conda install cudatoolkit=11.7 -c conda-forge -y
set CUDA_HOME=%CONDA_PREFIX%
Install localGPT
git clone https://github.com/PromtEngineer/localGPT.git
cd localGPT
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu117
Replace Bitsandbytes
pip uninstall bitsandbytes-windows -y
pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.40.1.post1-py3-none-win_amd64.whl
Replace AutoGPTQ
pip install https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.3.0/auto_gptq-0.3.0+cu118-cp310-cp310-win_amd64.whl
Configure localGPT to use GPTQ model
Open run_localGPT.py and configure the model_id
and model_basename
model_id = "TheBloke/Llama-2-7b-Chat-GPTQ"
model_basename = "gptq_model-4bit-128g.safetensors"
Now you should be able to run ingest.py
and also run the localGPT inference
The above install worked for my flawlessly. The key was having the right python (3.10) and the right cuda (117). In my base system I have cuda 12.2 and python 3.11. The conda environment as is stated above made it work.
Thanks!!
@PromtEngineer Just copy paste this into the Readme.md because this steps are the only ones that work.
By the way @andypotato the ingest.py still runs with CPU, is there a way around it?
Firstly, thank you for this Windows guide to accelerate with CUDA graphics @andypotato - you saved me hours and it took seconds to ingest the sample file. You can see the GPU memory and GPU processor increase below - it does take a while before GPU gets involved (I use the external drive as no space on my local SSDs). In comparison, it took minutes to do the same operation using the CPU.
Ensure you have the Nvidia CUDA runtime version 11.8 installed
I would like to highlight that it also works with CUDA version 12.2
- You can see it below.
I am running into multiple errors when trying to get localGPT to run on my Windows 11 / CUDA machine (3060 / 12 GB). Here is what I did so far:
Using this installation, I could run ingest.py and it built the index without any issue. So far so good!
I then changed model_id and model_basename as follows:
(I use this model in Oobabooga without any issues, it will easily fit on a 12 GB card)
Finally I try to run
run_localGPT.py
and this is where trouble starts:So far no (obvious) issues, however:
CUDA extension not installed?
This doesn't sound good?
Uh oh?
But then finally the prompt appears:
After entering a prompt, my GPU usage will go to 100% but never produce any output. I waited for about 5 minutes but it was just stuck at the high GPU usage and I had to abort the script.
Any ideas what's going on here? I'll be happy to help debug, but I currently don't know where to start