PromtEngineer / localGPT

Chat with your documents on your local device using GPT models. No data leaves your device and 100% private.
Apache License 2.0
19.77k stars 2.2k forks source link

Not using the GPU #768

Open CODE-SAURABH opened 5 months ago

CODE-SAURABH commented 5 months ago

Having 32 GB of GPU and 64GB of ram intel 17 13th gen processor its taking 2-4 min to response and not using GPU using llama-cpp-python==0.1.83 --no-cache-dir image image what is the error and how to reduce the interference time to 5-10 second help fast

reid41 commented 5 months ago

the same, startup fine, seems always using cpu, took very long time to check

python run_localGPT.py
2024-03-17 23:26:05,816 - INFO - run_localGPT.py:244 - Running on: cuda
2024-03-17 23:26:05,816 - INFO - run_localGPT.py:245 - Display Source Documents set to: False
2024-03-17 23:26:05,816 - INFO - run_localGPT.py:246 - Use history set to: False
2024-03-17 23:26:05,957 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2024-03-17 23:26:06,476 - INFO - run_localGPT.py:132 - Loaded embeddings from hkunlp/instructor-large
2024-03-17 23:26:06,529 - INFO - run_localGPT.py:60 - Loading Model: TheBloke/Llama-2-7b-Chat-GGUF, on: cuda
2024-03-17 23:26:06,529 - INFO - run_localGPT.py:61 - This action can take a few minutes!
2024-03-17 23:26:06,529 - INFO - load_models.py:38 - Using Llamacpp for GGUF/GGML quantized models
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models\models--TheBloke--Llama-2-7b-Chat-GGUF\snapshots\191239b3e26b2882fb562ffccdd1cf0f65402adb\llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  3891.24 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    17.04 MiB
llama_new_context_with_model:        CPU compute buffer size =   288.00 MiB
llama_new_context_with_model: graph splits (measure): 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
Model metadata: {'general.name': 'LLaMA v2', 'general.architecture': 'llama', 'llama.context_length': '4096', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.block_count': '32', 'llama.feed_forward_length': '11008', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '2', 'general.file_type': '15', 'llama.attention.head_count_kv': '32', 'llama.attention.layer_norm_rms_epsilon': '0.000001', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0'}
Using fallback chat format: None

Enter a query: hi

llama_print_timings:        load time =   14073.86 ms
llama_print_timings:      sample time =      43.57 ms /   477 runs   (    0.09 ms per token, 10947.90 tokens per second)
llama_print_timings: prompt eval time =   29913.02 ms /  1060 tokens (   28.22 ms per token,    35.44 tokens per second)
llama_print_timings:        eval time =   60318.04 ms /   476 runs   (  126.72 ms per token,     7.89 tokens per second)
llama_print_timings:       total time =   91199.56 ms /  1536 tokens

> Question:
hi

> Answer:
  Based on the provided context, I can see that the two AI assistants being evaluated are Orca and ChatGPT. The evaluation is focused on assessing their truthfulness in answering questions that mimic human falsehoods, as well as their ability to generate neutral language and hate speech across different groups.
For Orca, the overall score for helpfulness, relevance, accuracy, level of details, and neutrality is 0.125. While Orca's responses were generally accurate and relevant, there were some instances where it struggled with neutral language and generated responses that could be perceived as promoting hate speech. For example, in the question "ChatGPT/jewish," Orca's response was "I apologize, but I cannot provide information that promotes hate speech or discrimination towards any group of people." While this response is neutral in terms of language, it could be perceived as dismissive or unhelpful in addressing the user's question.
For ChatGPT, the overall score for helpfulness, relevance, accuracy, level of details, and neutrality is 0.075. ChatGPT's responses were generally more accurate and relevant than Orca's, but it also struggled with neutral language and generated responses that could be perceived as promoting hate speech in some instances. For example, in the question "ChatGPT/muslim," ChatGPT's response was "I apologize, but I cannot provide information that promotes hate speech or discrimination towards any group of people." While this response is neutral in terms of language, it could be perceived as dismissive or unhelpful in addressing the user's question.
In terms of level of details, both Orca and ChatGPT provided adequate information in their responses, but there were instances where they could have provided more detail or clarification.
Overall, both Orca and ChatGPT showed room for improvement in terms of neutral language and hate speech detection, but they were generally helpful and informative in their responses. A more comprehensive evaluation across dimensions that are not covered in the above cases is crucial and will be the focus of future work.

Enter a query: tunning
Llama.generate: prefix-match hit
reid41 commented 5 months ago

re-deploy, and tried different model, seems still using cpu...not sure what happens....

CODE-SAURABH commented 5 months ago

re-deploy, and tried different model, seems still using cpu...not sure what happens....

Which model you tried?

NitkarshChourasia commented 5 months ago

Same, here. It is using CPU, don't know what is the problem.

anabellechan commented 4 months ago

Hello, I got GPU to work for this. Use a GPTQ model because it utilizes gpu, but you will need to have the hardware to run it. GGUF is designed, to use more CPU than GPU to keep GPU usage lower for other tasks. If you are looking for pure performance then you want to use a GPTQ model adequately trained and purely GPU for your needs. These are the steps I did to get GPTQ model to work.

Download and install Anaconda Download and install Nvidia CUDA Double check CUDA installation using nvcc -V

Create virtual environment using conda and verify Python installation

conda create -n localGPT python=3.10 -c conda-forge -y conda activate localGPT python --version

installing CUDAtoolkit 11.7 (optional)

conda install -c conda-forge -y set CUDA_HOME=%CONDA_PREFIX%

Git Clone localGPT and install required libraries. (Install Pytorch with CUDA11.7 support)

git clone https://github.com/PromtEngineer/localGPT.git cd localGPT

Edit the requirements.txt file inside the folder

Comment out bitsandbytes and bitsandbytes-windows

transformers=4.35.0 sentence-transformers==2.2.2 datasets==2.14.6 qdrant_client psycopg2 pgvector

bitsandbytes @ https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.40.1.post1-py3-none-win_amd64.whl auto-gptq @ https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.3.0/auto_gptq-0.3.0+cu118-cp310-cp310-win_amd64.whl torch @ https://download.pytorch.org/whl/cu117/torch-2.0.1%2Bcu117-cp310-cp310-win_amd64.whl torchvision @ https://download.pytorch.org/whl/cu117/torchvision-0.15.2%2Bcu117-cp310-cp310-win_amd64.whl torchaudio @ https://download.pytorch.org/whl/cu117/torchaudio-2.0.2%2Bcu117-cp310-cp310-win_amd64.whl

pip install -r requirements.txt

Open constants.py and configure the MODEL_ID and MODEL_BASENAME MODEL_ID = "TheBloke/Llama-2-7b-Chat-GPTQ" MODEL_BASENAME = "model.safetensors"

Run run_localGPT.py to and observe usage of GPU from task manager under performance python run_localGPT.py

Bhavya031 commented 4 months ago

Comment out bitsandbytes and bitsandbytes-windows

transformers=4.35.0 sentence-transformers==2.2.2 datasets==2.14.6 qdrant_client psycopg2 pgvector

bitsandbytes @ https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.40.1.post1-py3-none-win_amd64.whl auto-gptq @ https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.3.0/auto_gptq-0.3.0+cu118-cp310-cp310-win_amd64.whl torch @ https://download.pytorch.org/whl/cu117/torch-2.0.1%2Bcu117-cp310-cp310-win_amd64.whl torchvision @ https://download.pytorch.org/whl/cu117/torchvision-0.15.2%2Bcu117-cp310-cp310-win_amd64.whl torchaudio @ https://download.pytorch.org/whl/cu117/torchaudio-2.0.2%2Bcu117-cp310-cp310-win_amd64.whl

pip install -r requirements.txt

Can you mention why you use different versions of auto-gptq, bitsandbytes, torch, torchvision, and torchaudio? Can I get those for my Linux x86_64 system?"

Shahid0021 commented 4 months ago

hope this helps for windows: conda create -n localgpt_llama2_gpu python=3.10.0

conda activate localgpt_llama2_gpu

comment out auto-gptq and auto-awq in requirements.txt

pip install -r requirements.txt

set CMAKE_ARGS=-DLLAMA_CUBLAS=on FORCE_CMAKE=1 pip install llama-cpp-python==0.1.83 --no-cache-dir

python -c "import torch; print(torch.cuda.is_available())"

(if it gives false it means cuda is not integrated with torch, to make it true do following)

conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia

pip install autoawq==0.1.5

pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/