PromtEngineer / localGPT

Chat with your documents on your local device using GPT models. No data leaves your device and 100% private.
Apache License 2.0
19.54k stars 2.19k forks source link

BLAS = 0 Always #520

Open erswelljustin opened 9 months ago

erswelljustin commented 9 months ago

Hi @PromtEngineer

I have followed the README instructions and also watched your latest YouTube video, but even if I set the --device_type to cuda manually when running the run_localGPT.py or run_localGPT_API the BLAS value is alwaus shown as BLAS = 0

I am running Ubuntu 22.04 and an NVidia RTX 4080. This is my lspci output for reference.

        VGA compatible controller: NVIDIA Corporation Device 2704 (rev a1) (prog-if 00 [VGA controller])
    Subsystem: Micro-Star International Co., Ltd. [MSI] Device 5112
    Flags: bus master, fast devsel, latency 0, IRQ 164
    Memory at 80000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 4000000000 (64-bit, prefetchable) [size=16G]
    Memory at 4400000000 (64-bit, prefetchable) [size=32M]
    I/O ports at 4000 [size=128]
    Expansion ROM at 81000000 [virtual] [disabled] [size=512K]
    Capabilities: <access denied>
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

I am using the following models in the constants.py

MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF"
MODEL_BASENAME = "llama-2-7b-chat.Q4_K_M.gguf"

Can you advise as currently it runs of of the CPU and ideally I'd like it to run off the very capable GPU.

Thanks!

thebetauser commented 9 months ago

GGUF (Formerly GGML) is only for CPU. If you are using CUDA you need the GPTQ models

rcantada commented 9 months ago

In my experience in Ubuntu 22.04, BLAS=0 happened when my build of llama-ccp failed to find my cuda-toolkit including cublas.h installation in an Anacoda environment. I had --verbose flag to see the logs.

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.83 --no-cache-dir --verbose

PromtEngineer commented 9 months ago

@erswelljustin As mentioned above, GGUF is a great option if you are running localGPT on Apple silicon or CPU. If you have access to an NVIDIA GPU, I would recommend to use GPTQ models. Also check if you have pytorch installed and have access to CUDA. In the same virtual evn, open python and run this code:

import torch print(torch.cuda.is_available())

N1h1lv5 commented 9 months ago

This allowed me to use cpu and gpu simultaniously with GGUF, for windows :

set the environment variable properly : $Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on" $Env:FORCE_CMAKE="1"

check that it works echo $Env:CMAKE_ARGS

uninstall previous version of llama-cpp-python pip uninstall llama-cpp-python

install the proper version : pip install llama-cpp-python==0.1.83 --no-cache-dir

@erswelljustin I would say, check your llama-cpp-python version.

erswelljustin commented 9 months ago

Thanks all for your help I will report back

erswelljustin commented 9 months ago

@PromtEngineer I am trying to use one of the Models as suggested in the constants.py for GPTQ as per your reply. I have also checked that torch.cuda.is_available(), which it is, however, I am getting an error that says:

FileNotFoundError: Could not find model in TheBloke/WizardLM-7B-uncensored-GPTQ

It is true that this isn't in the models folder but I felt sure the tutorial said that the model would be downloaded. I have uncommented lines 158 & 159 and commented out lines 98 & 99 of constants.py and I am running python3 run_localGPT.py --device_type cuda --show_sources --use_history

erswelljustin commented 9 months ago

@PromtEngineer I am trying to use one of the Models as suggested in the constants.py for GPTQ as per your reply. I have also checked that torch.cuda.is_available(), which it is, however, I am getting an error that says:

FileNotFoundError: Could not find model in TheBloke/WizardLM-7B-uncensored-GPTQ

It is true that this isn't in the models folder but I felt sure the tutorial said that the model would be downloaded. I have uncommented lines 158 & 159 and commented out lines 98 & 99 of constants.py and I am running python3 run_localGPT.py --device_type cuda --show_sources --use_history

I have updated the MODEL_BASMENAME to "model.safetensors" and it is working now thanks for your help

sanjeevzt commented 8 months ago

For Windows, BLAS=0 if we keep the doble quotation marks on, It works with GPU and shows BLAS=1 if we use without double quotation marks. Below worked for me:

setx CMAKE_ARGS -DLLAMA_CUBLAS=on setx FORCE_CMAKE 1 pip install llama-cpp-python==0.1.83 --no-cache-dir

OldFansBG commented 3 weeks ago

This allowed me to use cpu and gpu simultaniously with GGUF, for windows :

set the environment variable properly : $Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on" $Env:FORCE_CMAKE="1"

check that it works echo $Env:CMAKE_ARGS

uninstall previous version of llama-cpp-python pip uninstall llama-cpp-python

install the proper version : pip install llama-cpp-python==0.1.83 --no-cache-dir

@erswelljustin I would say, check your llama-cpp-python version.

This helped with my issue.