Open jiapei100 opened 1 year ago
@jiapei100 there is something else going on which is occupying GPU memory. For 7B ggml version, this will be using around 6G + 3GB for instructor embedding model. I am not sure what exactly is causing this.
@PromtEngineer
I successfully demonstrated TheBloke Llama-2-13B-chat-GGML with oobabooga text-generation-webui .
But, I just wanna have some fun with your localGPT. Well, unfortunately.... it comes with the above ERROR messages.
@PromtEngineer
It ran for a while, but ggml_new_tensor_impl: not enough space in the context's memory pool soon, even with a modified n_gpu_layers=32.
➜ localGPT git:(main) ✗ python run_localGPT.py --device_type cuda
2023-08-07 21:11:15.846266: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9346] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-08-07 21:11:15.846308: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-08-07 21:11:15.846314: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
2023-08-07 21:11:19,455 - INFO - run_localGPT.py:188 - Running on: cuda
2023-08-07 21:11:19,455 - INFO - run_localGPT.py:189 - Display Source Documents set to: False
2023-08-07 21:11:19,568 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length 512
2023-08-07 21:11:21,748 - INFO - run_localGPT.py:53 - Loading Model: TheBloke/Llama-2-7B-Chat-GGML, on: cuda
2023-08-07 21:11:21,749 - INFO - run_localGPT.py:54 - This action can take a few minutes!
2023-08-07 21:11:21,749 - INFO - run_localGPT.py:58 - Using Llamacpp for GGML quantized models
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5
llama.cpp: loading model from ~/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_head_kv = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llama_model_load_internal: mem required = 468.40 MB (+ 1024.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 4954 MB
llama_new_context_with_model: kv self size = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
Enter a query: How long are you able to finish loading?
llama_print_timings: load time = 2607.55 ms
llama_print_timings: sample time = 26.23 ms / 42 runs ( 0.62 ms per token, 1601.04 tokens per second)
llama_print_timings: prompt eval time = 2607.49 ms / 489 tokens ( 5.33 ms per token, 187.54 tokens per second)
llama_print_timings: eval time = 911.52 ms / 41 runs ( 22.23 ms per token, 44.98 tokens per second)
llama_print_timings: total time = 3606.91 ms
> Question:
How long are you able to finish loading?
> Answer:
I'm just an AI assistant, I don't have a limit on how long I can continue to answer questions. However, if you need help with something else, feel free to ask!
Enter a query: Why do you answer me twice?
Llama.generate: prefix-match hit
llama_print_timings: load time = 2607.55 ms
llama_print_timings: sample time = 58.11 ms / 93 runs ( 0.62 ms per token, 1600.41 tokens per second)
llama_print_timings: prompt eval time = 3198.56 ms / 590 tokens ( 5.42 ms per token, 184.46 tokens per second)
llama_print_timings: eval time = 2098.86 ms / 92 runs ( 22.81 ms per token, 43.83 tokens per second)
llama_print_timings: total time = 5494.41 ms
> Question:
Why do you answer me twice?
> Answer:
I apologize for repeating my answer. It was an error on my part. To answer your original question, I don't know the answer to why the President can serve more than two terms as President according to the 22nd and 23rd Amendments to the United States Constitution. The amendments do not provide a clear explanation for this limitation, and it is not explicitly stated in the text of the amendments themselves.
Enter a query: Have you ever heard of jetbot?
Llama.generate: prefix-match hit
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 18617376, available 10485760)
[2] 374615 segmentation fault (core dumped) python run_localGPT.py --device_type cuda
@jiapei100 interesting, there might be a memory leak somewhere. Can you test a GPTQ or HF model instead of GGML?
@PromtEngineer It is interesting. This time, I only asked 1 question, and got the following quit:
......
llama_new_context_with_model: kv self size = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
Enter a query: Hi, how are you?
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 18682928, available 10485760)
[2] 379801 segmentation fault (core dumped) python run_localGPT.py --device_type cuda
However, I can clearly tell the GPU memory is still far from being used out ...
Could it have something to do with 2 GPUs? From the GPU memory usage, it looks it's trying to use both GPUs, instead of just one?
@PromtEngineer
Debug told me it's just this line:
# Get the answer from the chain
res = qa(query)
Should be this:
<class 'langchain.chains.retrieval_qa.base.RetrievalQA'>
➜ ~ pip show langchain
Name: langchain
Version: 0.0.256
Summary: Building applications with LLMs through composability
Home-page: https://www.github.com/hwchase17/langchain
Author:
Author-email:
License: MIT
Location: ~/.local/lib/python3.10/site-packages
Requires: aiohttp, async-timeout, dataclasses-json, langsmith, numexpr, numpy, openapi-schema-pydantic, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: h2ogpt, llama-index
Already the NEWEST...
same issue here. I fixed it by rebuilding the vector db index, and deleting some files in the SOURCE DOC.
tree SOURCE_DOCUMENTS/
SOURCE_DOCUMENTS/
├── test2.txt
└── test.txt
0 directories, 2 files
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer llama_model_load_internal: offloading 32 repeating layers to GPU llama_model_load_internal: offloading non-repeating layers to GPU llama_model_load_internal: offloading v cache to GPU llama_model_load_internal: offloading k cache to GPU llama_model_load_internal: offloaded 35/35 layers to GPU llama_model_load_internal: total VRAM used: 4101 MB llama_new_context_with_model: kv self size = 1024.00 MB Enter a query: Can you explain briefly to me what is the Python programming language? 2023-08-08 06:27:31,471 - ERROR - chroma.py:129 - Chroma collection langchain contains fewer than 4 elements. ggml_new_tensor_impl: not enough space in the context's memory pool (needed 11013344, available 10485760) Segmentation fault (core dumped)
- then delete ./DB/*
rm ./DB/* -rf rm SOURCE_DOCUMENTS/test.txt tree SOURCE_DOCUMENTS/ SOURCE_DOCUMENTS/ └── test2.txt
python3 ingest.py ... ... python3 run_localGPT.py llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer llama_model_load_internal: offloading 32 repeating layers to GPU llama_model_load_internal: offloading non-repeating layers to GPU llama_model_load_internal: offloading v cache to GPU llama_model_load_internal: offloading k cache to GPU llama_model_load_internal: offloaded 35/35 layers to GPU llama_model_load_internal: total VRAM used: 4101 MB llama_new_context_with_model: kv self size = 1024.00 MB ... Enter a query: Can you explain briefly to me what is the Python programming language? 2023-08-08 06:35:37,554 - ERROR - chroma.py:129 - Chroma collection langchain contains fewer than 4 elements. 2023-08-08 06:35:37,555 - ERROR - chroma.py:129 - Chroma collection langchain contains fewer than 3 elements. 2023-08-08 06:35:37,556 - ERROR - chroma.py:129 - Chroma collection langchain contains fewer than 2 elements. Llama.generate: prefix-match hit
llama_print_timings: load time = 1202.17 ms llama_print_timings: sample time = 32.99 ms / 77 runs ( 0.43 ms per token, 2333.90 tokens per second) llama_print_timings: prompt eval time = 544.66 ms / 40 tokens ( 13.62 ms per token, 73.44 tokens per second) llama_print_timings: eval time = 2801.63 ms / 76 runs ( 36.86 ms per token, 27.13 tokens per second) llama_print_timings: total time = 3540.48 ms
Question: Can you explain briefly to me what is the Python programming language?
Answer: Sure! Python is a high-level programming language that is known for its simplicity and readability, making it easy for beginners and experienced programmers alike to learn and use. It can be used on a variety of platforms and has a vast number of libraries, frameworks, and tools available for various tasks. Do you want me to explain the topic in more detail?
- model used
```python
model_id = "TheBloke/Llama-2-7B-Chat-GGML"
model_basename = "llama-2-7b-chat.ggmlv3.q2_K.bin"
@jiapei100 I think it might be trying to use your 2080. Can you try the GPTQ models or HF models and see if you run into the same issue? or trying deleting the DB as @farlandliu suggested above.
I seem to face a similar issue. Models run fine on oobabooga text-generation-webui, they load into GPU and utilize the GPU. When I load them with LocalGPT, the 13B ones are extremely unresponsive. Its as if they are only using CPU. Its the same for GPTQ and GGML models for me.
`(localGPT3) D:\programming\localGPT>python run_localGPT.py
2023-08-08 23:37:15,990 - INFO - run_localGPT.py:181 - Running on: cuda
2023-08-08 23:37:15,990 - INFO - run_localGPT.py:182 - Display Source Documents set to: False
2023-08-08 23:37:16,696 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length 512
2023-08-08 23:37:23,046 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-08-08 23:37:23,075 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: D:\programming\localGPT/DB
2023-08-08 23:37:23,100 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-08-08 23:37:23,128 - INFO - json_impl.py:45 - Using ujson library for writing JSON byte strings
2023-08-08 23:37:23,181 - INFO - duckdb.py:460 - loaded in 72 embeddings
2023-08-08 23:37:23,184 - INFO - duckdb.py:472 - loaded in 1 collections
2023-08-08 23:37:23,189 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-08-08 23:37:23,189 - INFO - run_localGPT.py:45 - Loading Model: TheBloke/WizardLM-13B-Uncensored-GGML, on: cuda
2023-08-08 23:37:23,190 - INFO - run_localGPT.py:46 - This action can take a few minutes!
2023-08-08 23:37:23,190 - INFO - run_localGPT.py:50 - Using Llamacpp for GGML quantized models
llama.cpp: loading model from C:\Users\----\.cache\huggingface\hub\models--TheBloke--WizardLM-13B-Uncensored-GGML\snapshots\b0860b42a513a322423f00b71fec82066182bcfb\wizardLM-13B-Uncensored.ggmlv3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 15237.96 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Enter a query: test me
llama_print_timings: load time = 211020.42 ms
llama_print_timings: sample time = 1.20 ms / 7 runs ( 0.17 ms per token, 5843.07 tokens per second)
llama_print_timings: prompt eval time = 211020.28 ms / 1026 tokens ( 205.67 ms per token, 4.86 tokens per second)
llama_print_timings: eval time = 3198.85 ms / 6 runs ( 533.14 ms per token, 1.88 tokens per second)
llama_print_timings: total time = 214232.49 ms
Question:
test me
Answer:
I don't know.`
While the model runs and does give me an answer, it takes a very long time and I do not see any of the n_layers being offloaded to the GPU.
same issue here. I fixed it by rebuilding the vector db index, and deleting some files in the SOURCE DOC.
- before
tree SOURCE_DOCUMENTS/ SOURCE_DOCUMENTS/ ├── test2.txt └── test.txt 0 directories, 2 files llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer llama_model_load_internal: offloading 32 repeating layers to GPU llama_model_load_internal: offloading non-repeating layers to GPU llama_model_load_internal: offloading v cache to GPU llama_model_load_internal: offloading k cache to GPU llama_model_load_internal: offloaded 35/35 layers to GPU llama_model_load_internal: total VRAM used: 4101 MB llama_new_context_with_model: kv self size = 1024.00 MB Enter a query: Can you explain briefly to me what is the Python programming language? 2023-08-08 06:27:31,471 - ERROR - chroma.py:129 - Chroma collection langchain contains fewer than 4 elements. ggml_new_tensor_impl: not enough space in the context's memory pool (needed 11013344, available 10485760) Segmentation fault (core dumped)
- then delete ./DB/*
rm ./DB/* -rf rm SOURCE_DOCUMENTS/test.txt tree SOURCE_DOCUMENTS/ SOURCE_DOCUMENTS/ └── test2.txt > python3 ingest.py ... ... > python3 run_localGPT.py llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer llama_model_load_internal: offloading 32 repeating layers to GPU llama_model_load_internal: offloading non-repeating layers to GPU llama_model_load_internal: offloading v cache to GPU llama_model_load_internal: offloading k cache to GPU llama_model_load_internal: offloaded 35/35 layers to GPU llama_model_load_internal: total VRAM used: 4101 MB llama_new_context_with_model: kv self size = 1024.00 MB ... Enter a query: Can you explain briefly to me what is the Python programming language? 2023-08-08 06:35:37,554 - ERROR - chroma.py:129 - Chroma collection langchain contains fewer than 4 elements. 2023-08-08 06:35:37,555 - ERROR - chroma.py:129 - Chroma collection langchain contains fewer than 3 elements. 2023-08-08 06:35:37,556 - ERROR - chroma.py:129 - Chroma collection langchain contains fewer than 2 elements. Llama.generate: prefix-match hit llama_print_timings: load time = 1202.17 ms llama_print_timings: sample time = 32.99 ms / 77 runs ( 0.43 ms per token, 2333.90 tokens per second) llama_print_timings: prompt eval time = 544.66 ms / 40 tokens ( 13.62 ms per token, 73.44 tokens per second) llama_print_timings: eval time = 2801.63 ms / 76 runs ( 36.86 ms per token, 27.13 tokens per second) llama_print_timings: total time = 3540.48 ms > Question: Can you explain briefly to me what is the Python programming language? > Answer: Sure! Python is a high-level programming language that is known for its simplicity and readability, making it easy for beginners and experienced programmers alike to learn and use. It can be used on a variety of platforms and has a vast number of libraries, frameworks, and tools available for various tasks. Do you want me to explain the topic in more detail?
- model used
model_id = "TheBloke/Llama-2-7B-Chat-GGML" model_basename = "llama-2-7b-chat.ggmlv3.q2_K.bin"
Not working for me ...
@jiapei100 I think it might be trying to use your 2080. Can you try the GPTQ models or HF models and see if you run into the same issue? or trying deleting the DB as @farlandliu suggested above.
As someone who encountered the same issue, for me, using a GPTQ model instead of GGML worked.
I believe I used to run llama-2-7b-chat.ggmlv3.q4_0.bin successfully locally. My 3090 comes with 24G GPU memory, which should be just enough for running this model. Well, how much memoery this llama-2-7b-chat.ggmlv3.q4_0.bin require minimum when using locaGPT??
Cheers