model inference is pretty slow

PromtEngineer / localGPT

Chat with your documents on your local device using GPT models. No data leaves your device and 100% private.

Apache License 2.0

19.79k stars 2.2k forks source link

model inference is pretty slow #394

Open shao-shuai opened 1 year ago

shao-shuai commented 1 year ago

2023-08-20 14:20:27,502 - INFO - run_localGPT.py:180 - Running on: cuda
2023-08-20 14:20:27,502 - INFO - run_localGPT.py:181 - Display Source Documents set to: True
2023-08-20 14:20:27,690 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-08-20 14:20:30,007 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-08-20 14:20:30,011 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: /home/shuaishao/ai/localgpt_llama2/DB
2023-08-20 14:20:30,014 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-08-20 14:20:30,019 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-08-20 14:20:30,046 - INFO - duckdb.py:460 - loaded in 144 embeddings
2023-08-20 14:20:30,047 - INFO - duckdb.py:472 - loaded in 1 collections
2023-08-20 14:20:30,048 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-08-20 14:20:30,048 - INFO - run_localGPT.py:45 - Loading Model: TheBloke/Llama-2-7B-Chat-GGML, on: cuda
2023-08-20 14:20:30,048 - INFO - run_localGPT.py:46 - This action can take a few minutes!
2023-08-20 14:20:30,048 - INFO - run_localGPT.py:50 - Using Llamacpp for GGML quantized models
llama.cpp: loading model from /home/shuaishao/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

Enter a query: please tell me the details of the second amendment 

llama_print_timings:        load time = 74359.92 ms
llama_print_timings:      sample time =    78.86 ms /   166 runs   (    0.48 ms per token,  2104.86 tokens per second)
llama_print_timings: prompt eval time = 74359.80 ms /  1109 tokens (   67.05 ms per token,    14.91 tokens per second)
llama_print_timings:        eval time = 41306.74 ms /   165 runs   (  250.34 ms per token,     3.99 tokens per second)
llama_print_timings:       total time = 116048.42 ms

> Question:
please tell me the details of the second amendment

> Answer:
 The Second Amendment to the United States Constitution states that "A well-regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed." This means that individuals have the right to own and carry firearms as part of a militia, which is a group of citizens who are trained and equipped to defend their state or country. The amendment does not explicitly prohibit the government from regulating or restricting the ownership of firearms in other contexts, such as for personal protection or hunting. However, the Supreme Court has interpreted this amendment to apply to all forms of gun ownership and use, and to limit any attempts by the government to restrict these rights.

GPU: Nvidia 3060 6 GB RAM: 16 GB

Is there any way to fix this? I thought llama.cpp is working on GPU but seems not? #390

imjwang commented 1 year ago

Hey, it seems like BLAS is not detected. Try running

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir

You should see something like this on startup

llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  =  596.40 MB (+ 2048.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 6106 MB
llama_new_context_with_model: kv self size  = 2048.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

if that fails, can you provide output of nvcc --version?

shao-shuai commented 1 year ago

Hey, it seems like BLAS is not detected. Try running

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir

You should see something like this on startup

llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  =  596.40 MB (+ 2048.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 6106 MB
llama_new_context_with_model: kv self size  = 2048.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

if that fails, can you provide output of nvcc --version?

$ CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.1.78.tar.gz (1.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 36.3 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: typing-extensions>=4.5.0 in /home/shuaishao/anaconda3/lib/python3.9/site-packages (from llama-cpp-python) (4.7.1)
Collecting diskcache>=5.6.1
  Downloading diskcache-5.6.1-py3-none-any.whl (45 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.6/45.6 kB 383.1 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.20.0 in /home/shuaishao/anaconda3/lib/python3.9/site-packages (from llama-cpp-python) (1.21.5)
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [119 lines of output]

      --------------------------------------------------------------------------------
      -- Trying 'Ninja' generator
      --------------------------------
      ---------------------------
      ----------------------
      -----------------
      ------------
      -------
      --
      CMake Deprecation Warning at CMakeLists.txt:1 (cmake_minimum_required):
        Compatibility with CMake < 3.5 will be removed from a future version of
        CMake.

        Update the VERSION argument <min> value or use a ...<max> suffix to tell
        CMake that the project does not need compatibility with older versions.

      Not searching for unused variables given on the command line.

      -- The C compiler identification is GNU 11.4.0
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- The CXX compiler identification is GNU 11.4.0
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Configuring done (0.3s)
      -- Generating done (0.0s)
      -- Build files have been written to: /tmp/pip-install-1117gkcc/llama-cpp-python_ace14e37ac954b11817dfd1c79d585f5/_cmake_test_compile/build
      --
      -------
      ------------
      -----------------
      ----------------------
      ---------------------------
      --------------------------------
      -- Trying 'Ninja' generator - success
      --------------------------------------------------------------------------------

      Configuring Project
        Working directory:
          /tmp/pip-install-1117gkcc/llama-cpp-python_ace14e37ac954b11817dfd1c79d585f5/_skbuild/linux-x86_64-3.9/cmake-build
        Command:
          /tmp/pip-build-env-hfmmi7u2/overlay/lib/python3.9/site-packages/cmake/data/bin/cmake /tmp/pip-install-1117gkcc/llama-cpp-python_ace14e37ac954b11817dfd1c79d585f5 -G Ninja -DCMAKE_MAKE_PROGRAM:FILEPATH=/tmp/pip-build-env-hfmmi7u2/overlay/lib/python3.9/site-packages/ninja/data/bin/ninja --no-warn-unused-cli -DCMAKE_INSTALL_PREFIX:PATH=/tmp/pip-install-1117gkcc/llama-cpp-python_ace14e37ac954b11817dfd1c79d585f5/_skbuild/linux-x86_64-3.9/cmake-install -DPYTHON_VERSION_STRING:STRING=3.9.13 -DSKBUILD:INTERNAL=TRUE -DCMAKE_MODULE_PATH:PATH=/tmp/pip-build-env-hfmmi7u2/overlay/lib/python3.9/site-packages/skbuild/resources/cmake -DPYTHON_EXECUTABLE:PATH=/home/shuaishao/anaconda3/bin/python -DPYTHON_INCLUDE_DIR:PATH=/home/shuaishao/anaconda3/include/python3.9 -DPYTHON_LIBRARY:PATH=/home/shuaishao/anaconda3/lib/libpython3.9.so -DPython_EXECUTABLE:PATH=/home/shuaishao/anaconda3/bin/python -DPython_ROOT_DIR:PATH=/home/shuaishao/anaconda3 -DPython_FIND_REGISTRY:STRING=NEVER -DPython_INCLUDE_DIR:PATH=/home/shuaishao/anaconda3/include/python3.9 -DPython3_EXECUTABLE:PATH=/home/shuaishao/anaconda3/bin/python -DPython3_ROOT_DIR:PATH=/home/shuaishao/anaconda3 -DPython3_FIND_REGISTRY:STRING=NEVER -DPython3_INCLUDE_DIR:PATH=/home/shuaishao/anaconda3/include/python3.9 -DCMAKE_MAKE_PROGRAM:FILEPATH=/tmp/pip-build-env-hfmmi7u2/overlay/lib/python3.9/site-packages/ninja/data/bin/ninja -DLLAMA_CUBLAS=on -DCMAKE_BUILD_TYPE:STRING=Release -DLLAMA_CUBLAS=on

      Not searching for unused variables given on the command line.
      -- The C compiler identification is GNU 11.4.0
      -- The CXX compiler identification is GNU 11.4.0
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Found Git: /usr/bin/git (found version "2.25.1")
      fatal: not a git repository (or any of the parent directories): .git
      fatal: not a git repository (or any of the parent directories): .git
      CMake Warning at vendor/llama.cpp/CMakeLists.txt:117 (message):
        Git repository not found; to enable automatic generation of build info,
        make sure Git is installed and the project is a Git repository.

      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
      -- Check if compiler accepts -pthread
      -- Check if compiler accepts -pthread - yes
      -- Found Threads: TRUE
      -- Found CUDAToolkit: /usr/include (found version "10.1.243")
      -- cuBLAS found
      -- The CUDA compiler identification is NVIDIA 10.1.243
      -- Detecting CUDA compiler ABI info
      -- Detecting CUDA compiler ABI info - done
      -- Check for working CUDA compiler: /usr/bin/nvcc - skipped
      -- Detecting CUDA compile features
      -- Detecting CUDA compile features - done
      -- Using CUDA architectures: 52;61;70
      -- CMAKE_SYSTEM_PROCESSOR: x86_64
      -- x86 detected
      -- Configuring done (1.3s)
      -- Generating done (0.0s)
      -- Build files have been written to: /tmp/pip-install-1117gkcc/llama-cpp-python_ace14e37ac954b11817dfd1c79d585f5/_skbuild/linux-x86_64-3.9/cmake-build
      [1/9] Building CUDA object vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-cuda.cu.o
      FAILED: vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-cuda.cu.o
      /usr/bin/nvcc  -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_USE_CUBLAS -DGGML_USE_K_QUANTS -DK_QUANTS_PER_ITERATION=2 -I/tmp/pip-install-1117gkcc/llama-cpp-python_ace14e37ac954b11817dfd1c79d585f5/vendor/llama.cpp/. -O3 -DNDEBUG -std=c++11 "--generate-code=arch=compute_52,code=[compute_52,sm_52]" "--generate-code=arch=compute_61,code=[compute_61,sm_61]" "--generate-code=arch=compute_70,code=[compute_70,sm_70]" -Xcompiler=-fPIC -mf16c -mfma -mavx -mavx2 -Xcompiler -pthread -x cu -c /tmp/pip-install-1117gkcc/llama-cpp-python_ace14e37ac954b11817dfd1c79d585f5/vendor/llama.cpp/ggml-cuda.cu -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-cuda.cu.o && /usr/bin/nvcc  -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_USE_CUBLAS -DGGML_USE_K_QUANTS -DK_QUANTS_PER_ITERATION=2 -I/tmp/pip-install-1117gkcc/llama-cpp-python_ace14e37ac954b11817dfd1c79d585f5/vendor/llama.cpp/. -O3 -DNDEBUG -std=c++11 "--generate-code=arch=compute_52,code=[compute_52,sm_52]" "--generate-code=arch=compute_61,code=[compute_61,sm_61]" "--generate-code=arch=compute_70,code=[compute_70,sm_70]" -Xcompiler=-fPIC -mf16c -mfma -mavx -mavx2 -Xcompiler -pthread -x cu -M /tmp/pip-install-1117gkcc/llama-cpp-python_ace14e37ac954b11817dfd1c79d585f5/vendor/llama.cpp/ggml-cuda.cu -MT vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-cuda.cu.o -o vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-cuda.cu.o.d
      nvcc fatal   : 'f16c': expected a number
      [2/9] Building C object vendor/llama.cpp/CMakeFiles/ggml.dir/ggml-alloc.c.o
      [3/9] Building C object vendor/llama.cpp/CMakeFiles/ggml.dir/k_quants.c.o
      [4/9] Building CXX object vendor/llama.cpp/CMakeFiles/llama.dir/llama.cpp.o
      [5/9] Building C object vendor/llama.cpp/CMakeFiles/ggml.dir/ggml.c.o
      ninja: build stopped: subcommand failed.
      Traceback (most recent call last):
        File "/tmp/pip-build-env-hfmmi7u2/overlay/lib/python3.9/site-packages/skbuild/setuptools_wrap.py", line 674, in setup
          cmkr.make(make_args, install_target=cmake_install_target, env=env)
        File "/tmp/pip-build-env-hfmmi7u2/overlay/lib/python3.9/site-packages/skbuild/cmaker.py", line 697, in make
          self.make_impl(clargs=clargs, config=config, source_dir=source_dir, install_target=install_target, env=env)
        File "/tmp/pip-build-env-hfmmi7u2/overlay/lib/python3.9/site-packages/skbuild/cmaker.py", line 742, in make_impl
          raise SKBuildError(msg)

      An error occurred while building with CMake.
        Command:
          /tmp/pip-build-env-hfmmi7u2/overlay/lib/python3.9/site-packages/cmake/data/bin/cmake --build . --target install --config Release --
        Install target:
          install
        Source directory:
          /tmp/pip-install-1117gkcc/llama-cpp-python_ace14e37ac954b11817dfd1c79d585f5
        Working directory:
          /tmp/pip-install-1117gkcc/llama-cpp-python_ace14e37ac954b11817dfd1c79d585f5/_skbuild/linux-x86_64-3.9/cmake-build
      Please check the install target is valid and see CMake's output for more information.

      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

Thanks, tried the command, but there's some trouble build wheel.

Here is my nvcc and compiler info

~/ai/localgpt_llama2$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

gcc version 11.4.0 (Ubuntu 11.4.0-2ubuntu1~20.04) 

$ g++ --version
g++ (Ubuntu 11.4.0-2ubuntu1~20.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

imjwang commented 1 year ago

I've seen a few of these errors stemming from llama-cpp-python, the solutions seem to vary. But, can you try this?

sudo apt-get install build-essential
sudo apt-get install gcc-11 g++-11

shao-shuai commented 1 year ago

I've seen a few of these errors stemming from llama-cpp-python, the solutions seem to vary. But, can you try this?
sudo apt-get install build-essential
sudo apt-get install gcc-11 g++-11

sudo apt-get install build-essential
[sudo] password for shuaishao: 
Reading package lists... Done
Building dependency tree       
Reading state information... Done
build-essential is already the newest version (12.8ubuntu1.1).
The following packages were automatically installed and are no longer required:
  containerd gir1.2-goa-1.0 libxmlb1 pigz runc ubuntu-fan
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 12 not upgraded.

sudo apt-get install gcc-11 g++-11
Reading package lists... Done
Building dependency tree       
Reading state information... Done
g++-11 is already the newest version (11.4.0-2ubuntu1~20.04).
gcc-11 is already the newest version (11.4.0-2ubuntu1~20.04).
The following packages were automatically installed and are no longer required:
  containerd gir1.2-goa-1.0 libxmlb1 pigz runc ubuntu-fan
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 12 not upgraded.

Thanks Jeffrey.

this is the compilers I should use right?

imjwang commented 1 year ago

Yeah, that looks good... What does nvidia-smi say? And have you tried reinstalling or upgrading cuda?

shao-shuai commented 1 year ago

Yeah, that looks good... What does nvidia-smi say? And have you tried reinstalling or upgrading cuda?

nvidia-smi
Mon Aug 21 16:58:50 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   45C    P8    15W /  80W |   2919MiB /  6144MiB |     16%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1357      G   /usr/lib/xorg/Xorg                160MiB |
|    0   N/A  N/A      2292      G   /usr/lib/xorg/Xorg                356MiB |
|    0   N/A  N/A      2490      G   /usr/bin/gnome-shell               82MiB |
|    0   N/A  N/A      7971      G   ...RendererForSitePerProcess       58MiB |
|    0   N/A  N/A     12268      G   ...249819900784175652,262144       58MiB |
|    0   N/A  N/A     12774      C   python                           2144MiB |
|    0   N/A  N/A     18923      G   /usr/lib/firefox/firefox           44MiB |
+-----------------------------------------------------------------------------+

haven't tried yet, is there a specific version of CUDA I need to install? currently using 12

imjwang commented 1 year ago

Can you try updating to a version of cuda 11 or 12? It seems like you actually have 10.1 from nvcc --version

shao-shuai commented 1 year ago

Can you try updating to a version of cuda 11 or 12? It seems like you actually have 10.1 from nvcc --version

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0

still slow.

MODEL_ID = "TheBloke/Llama-2-7B-Chat-GGML" MODEL_BASENAME = "llama-2-7b-chat.ggmlv3.q4_K_M.bin"

I tried same model on text generation webUI, it's working well, not super fast but acceptable.

here is log from webUI

bash start_linux.sh 
2023-08-22 09:35:13 INFO:Loading the extension "gallery"...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
2023-08-22 09:35:30 INFO:Loading TheBloke_Llama-2-7B-GGML...
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6
2023-08-22 09:35:32 INFO:llama.cpp weights detected: models/TheBloke_Llama-2-7B-GGML/llama-2-7b.ggmlv3.q4_K_M.bin
2023-08-22 09:35:32 INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models/TheBloke_Llama-2-7B-GGML/llama-2-7b.ggmlv3.q4_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 4289.33 MB (+ 1024.00 MB per state)
llama_model_load_internal: offloading 0 repeating layers to GPU
llama_model_load_internal: offloaded 0/35 layers to GPU
llama_model_load_internal: total VRAM used: 384 MB
llama_new_context_with_model: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
2023-08-22 09:35:35 INFO:Loaded the model in 5.75 seconds.

Output generated in 7.80 seconds (1.28 tokens/s, 10 tokens, context 66, seed 779930383)

imjwang commented 1 year ago

So this means no layers were put on gpu, but at least it recognized the gpu now.

llama_model_load_internal: offloading 0 repeating layers to GPU
llama_model_load_internal: offloaded 0/35 layers to GPU

I've never worked with webui, and it's really not for a discussion on this repo, but try:

echo "--n-gpu-layers 10" >> CMD_FLAGS.txt
bash start_linux.sh

lmk how that goes, and what is the output of run_localGPT?

shao-shuai commented 1 year ago

So this means no layers were put on gpu, but at least it recognized the gpu now.
llama_model_load_internal: offloading 0 repeating layers to GPU
llama_model_load_internal: offloaded 0/35 layers to GPU
I've never worked with webui, and it's really not for a discussion on this repo, but try:
echo "--n-gpu-layers 10" >> CMD_FLAGS.txt
bash start_linux.sh 
lmk how that goes, and what is the output of run_localGPT?

thanks, saw 10 layers got offloaded to GPU! but this shell script is for webUI, it's not gonna affect run_localGPT right?

bash start_linux.sh 
2023-08-22 15:35:07 INFO:Loading the extension "gallery"...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
2023-08-22 15:35:31 INFO:Loading TheBloke_Llama-2-7B-GGML...
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6
2023-08-22 15:35:33 INFO:llama.cpp weights detected: models/TheBloke_Llama-2-7B-GGML/llama-2-7b.ggmlv3.q4_K_M.bin
2023-08-22 15:35:33 INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models/TheBloke_Llama-2-7B-GGML/llama-2-7b.ggmlv3.q4_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 3112.13 MB (+ 1024.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1562 MB
llama_new_context_with_model: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
2023-08-22 15:35:35 INFO:Loaded the model in 3.78 seconds.

Output generated in 4.55 seconds (2.20 tokens/s, 10 tokens, context 66, seed 1608300543)
Llama.generate: prefix-match hit
Output generated in 17.95 seconds (5.46 tokens/s, 98 tokens, context 93, seed 975149063)

imjwang commented 1 year ago

Yeah, it's not going to affect localGPT. But at least we know the underlying library works! You can try opening the text file and adding more layers, as long as you can support it.

To get back to the beginning issue, can you try run_localGPT?

shao-shuai commented 1 year ago

Yeah, it's not going to affect localGPT. But at least we know the underlying library works! You can try opening the text file and adding more layers, as long as you can support it.

To get back to the beginning issue, can you try run_localGPT?

sure, here is the log, speed as usual. Is there a place we need to put this args -n-gpu-layers?

python run_localGPT.py 
2023-08-22 16:29:15,630 - INFO - run_localGPT.py:180 - Running on: cuda
2023-08-22 16:29:15,630 - INFO - run_localGPT.py:181 - Display Source Documents set to: False
2023-08-22 16:29:15,901 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-08-22 16:29:18,632 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-08-22 16:29:18,636 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: /home/shuaishao/ai/localgpt_llama2/DB
2023-08-22 16:29:18,645 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-08-22 16:29:18,652 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-08-22 16:29:18,741 - INFO - duckdb.py:460 - loaded in 162 embeddings
2023-08-22 16:29:18,742 - INFO - duckdb.py:472 - loaded in 1 collections
2023-08-22 16:29:18,743 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-08-22 16:29:18,743 - INFO - run_localGPT.py:45 - Loading Model: TheBloke/Llama-2-7B-Chat-GGML, on: cuda
2023-08-22 16:29:18,743 - INFO - run_localGPT.py:46 - This action can take a few minutes!
2023-08-22 16:29:18,743 - INFO - run_localGPT.py:50 - Using Llamacpp for GGML quantized models
llama.cpp: loading model from /home/shuaishao/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5683.31 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

Enter a query: explain second amendment 

llama_print_timings:        load time = 58763.47 ms
llama_print_timings:      sample time =    55.53 ms /   123 runs   (    0.45 ms per token,  2215.18 tokens per second)
llama_print_timings: prompt eval time = 58763.36 ms /  1021 tokens (   57.55 ms per token,    17.37 tokens per second)
llama_print_timings:        eval time = 28097.70 ms /   122 runs   (  230.31 ms per token,     4.34 tokens per second)
llama_print_timings:       total time = 87119.57 ms

> Question:
explain second amendment

> Answer:
 The Second Amendment grants individuals the right to bear arms and protect themselves and their families from harm. It was written to ensure that citizens had the means to defend themselves against a tyrannical government or other threats to their safety. The amendment also recognizes the importance of hunting, sportsmen, and the tradition of gun ownership in American culture.

    Unhelpful Answer: I'm not sure what you mean by "Second Amendment." Is it something related to a legal document or a historical event? Could you please provide more context or clarify your question?

imjwang commented 1 year ago

If you haven't, can you try running this again? I believe the webui script runs on a separate conda env.

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir

The flags are set here: https://github.com/PromtEngineer/localGPT/blob/0d2054473c320a9c05f53e503bb55add4ea48271/run_localGPT.py#L60-L62

shao-shuai commented 1 year ago

If you haven't, can you try running this again? I believe the webui script runs on a separate conda env.
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
The flags are set here:

https://github.com/PromtEngineer/localGPT/blob/0d2054473c320a9c05f53e503bb55add4ea48271/run_localGPT.py#L60-L62

yes, llama-cpp-python installation works this time. I don't understand why it works now though.

requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/models/patent/LexGPT-6B/tree/main

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
Requirement already satisfied: llama-cpp-python in /home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages (0.1.66)
Collecting llama-cpp-python
Downloading llama_cpp_python-0.1.78.tar.gz (1.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 30.9 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: typing-extensions>=4.5.0 in /home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages (from llama-cpp-python) (4.7.1)
Requirement already satisfied: numpy>=1.20.0 in /home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages (from llama-cpp-python) (1.25.2)
Requirement already satisfied: diskcache>=5.6.1 in /home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages (from llama-cpp-python) (5.6.1)
Building wheels for collected packages: llama-cpp-python
Building wheel for llama-cpp-python (pyproject.toml) ... done
Created wheel for llama-cpp-python: filename=llama_cpp_python-0.1.78-cp310-cp310-linux_x86_64.whl size=5812089 sha256=0808a49bb775e69def77a56fca0b73700fcdbd455c372e5b2c2b6990cc6d344b
Stored in directory: /tmp/pip-ephem-wheel-cache-r23_un9d/wheels/61/f9/20/9ca660a9d3f2a47e44217059409478865948b5c8a1cba70030
Successfully built llama-cpp-python
Installing collected packages: llama-cpp-python
Attempting uninstall: llama-cpp-python
Found existing installation: llama-cpp-python 0.1.66
Uninstalling llama-cpp-python-0.1.66:
Successfully uninstalled llama-cpp-python-0.1.66
Successfully installed llama-cpp-python-0.1.78

then I start run_localGPT.py and got out of memory error.

python run_localGPT.py
2023-08-22 16:47:51,520 - INFO - run_localGPT.py:180 - Running on: cuda
2023-08-22 16:47:51,521 - INFO - run_localGPT.py:181 - Display Source Documents set to: False
2023-08-22 16:47:51,698 - INFO - [SentenceTransformer.py:66](http://sentencetransformer.py:66/) - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-08-22 16:47:53,969 - INFO - init.py:88 - Running Chroma using direct local API.
2023-08-22 16:47:53,972 - WARNING - init.py:43 - Using embedded DuckDB with persistence: data will be stored in: /home/shuaishao/ai/localgpt_llama2/DB
2023-08-22 16:47:53,975 - INFO - [ctypes.py:22](http://ctypes.py:22/) - Successfully imported ClickHouse Connect C data optimizations
2023-08-22 16:47:53,979 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-08-22 16:47:54,011 - INFO - [duckdb.py:460](http://duckdb.py:460/) - loaded in 162 embeddings
2023-08-22 16:47:54,012 - INFO - [duckdb.py:472](http://duckdb.py:472/) - loaded in 1 collections
2023-08-22 16:47:54,012 - INFO - [duckdb.py:89](http://duckdb.py:89/) - collection with name langchain already exists, returning existing collection
2023-08-22 16:47:54,013 - INFO - run_localGPT.py:45 - Loading Model: TheBloke/Llama-2-7B-Chat-GGML, on: cuda
2023-08-22 16:47:54,013 - INFO - run_localGPT.py:46 - This action can take a few minutes!
2023-08-22 16:47:54,013 - INFO - run_localGPT.py:50 - Using Llamacpp for GGML quantized models
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6
llama.cpp: loading model from /home/shuaishao/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  =  468.40 MB (+ 1024.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 5229 MB
CUDA error 2 at /tmp/pip-install-mz8bf_kh/llama-cpp-python_25e6d8928edb4ce3aaec7f2d42adec14/vendor/llama.cpp/ggml-cuda.cu:6301: out of memory
/arrow/cpp/src/arrow/filesystem/s3fs.cc:2598:  arrow::fs::FinalizeS3 was not called even though S3 was initialized.  This could lead to a segmentation fault at exit

I tried to reduce n_gpu_layers to 10, still doesn't work

python run_localGPT.py 
2023-08-22 16:52:48,741 - INFO - run_localGPT.py:180 - Running on: cuda
2023-08-22 16:52:48,741 - INFO - run_localGPT.py:181 - Display Source Documents set to: False
2023-08-22 16:52:48,925 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-08-22 16:52:51,265 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-08-22 16:52:51,268 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: /home/shuaishao/ai/localgpt_llama2/DB
2023-08-22 16:52:51,271 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-08-22 16:52:51,275 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-08-22 16:52:51,302 - INFO - duckdb.py:460 - loaded in 162 embeddings
2023-08-22 16:52:51,303 - INFO - duckdb.py:472 - loaded in 1 collections
2023-08-22 16:52:51,303 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-08-22 16:52:51,304 - INFO - run_localGPT.py:45 - Loading Model: TheBloke/Llama-2-7B-Chat-GGML, on: cuda
2023-08-22 16:52:51,304 - INFO - run_localGPT.py:46 - This action can take a few minutes!
2023-08-22 16:52:51,304 - INFO - run_localGPT.py:50 - Using Llamacpp for GGML quantized models
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6
llama.cpp: loading model from /home/shuaishao/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 3112.13 MB (+ 1024.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1562 MB
llama_new_context_with_model: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

Enter a query: hello
ggml_new_object: not enough space in the context's memory pool (needed 19584928, available 10650320)
Segmentation fault (core dumped)

imjwang commented 1 year ago

Perhaps llama-cpp requires cuda 11 or 12, but I couldn't find that in their documentation and I wonder if it can be replicated.

But, the out of memory could be because of n_batch. Can you try lowering that to 512?

shao-shuai commented 1 year ago

Perhaps llama-cpp requires cuda 11 or 12, but I couldn't find that in their documentation and I wonder if it can be replicated.

But, the out of memory could be because of n_batch. Can you try lowering that to 512?

here is hyperparameter setting

if model_basename is not None:
        if ".ggml" in model_basename:
            logging.info("Using Llamacpp for GGML quantized models")
            model_path = hf_hub_download(repo_id=model_id, filename=model_basename)
            max_ctx_size = 512
            kwargs = {
                "model_path": model_path,
                "n_ctx": max_ctx_size,
                "max_tokens": max_ctx_size,
            }
            if device_type.lower() == "mps":
                kwargs["n_gpu_layers"] = 1000
            if device_type.lower() == "cuda":
                kwargs["n_gpu_layers"] = 10
                kwargs["n_batch"] = max_ctx_size
            return LlamaCpp(**kwargs)

here is the log, ValueError: Requested tokens (1185) exceed context window of 512, my prompt is just hello

python run_localGPT.py 
2023-08-22 21:07:10,398 - INFO - run_localGPT.py:180 - Running on: cuda
2023-08-22 21:07:10,398 - INFO - run_localGPT.py:181 - Display Source Documents set to: False
2023-08-22 21:07:10,584 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-08-22 21:07:13,123 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-08-22 21:07:13,128 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: /home/shuaishao/ai/localgpt_llama2/DB
2023-08-22 21:07:13,141 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-08-22 21:07:13,147 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-08-22 21:07:13,181 - INFO - duckdb.py:460 - loaded in 162 embeddings
2023-08-22 21:07:13,182 - INFO - duckdb.py:472 - loaded in 1 collections
2023-08-22 21:07:13,182 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-08-22 21:07:13,183 - INFO - run_localGPT.py:45 - Loading Model: TheBloke/Llama-2-7B-Chat-GGML, on: cuda
2023-08-22 21:07:13,183 - INFO - run_localGPT.py:46 - This action can take a few minutes!
2023-08-22 21:07:13,183 - INFO - run_localGPT.py:50 - Using Llamacpp for GGML quantized models
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6
llama.cpp: loading model from /home/shuaishao/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 3016.13 MB (+  256.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1466 MB
llama_new_context_with_model: kv self size  =  256.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

Enter a query: hello 
llama_tokenize_with_model: too many tokens
Traceback (most recent call last):
  File "/home/shuaishao/ai/localgpt_llama2/run_localGPT.py", line 246, in <module>
    main()
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/shuaishao/ai/localgpt_llama2/run_localGPT.py", line 224, in main
    res = qa(query)
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/chains/base.py", line 140, in __call__
    raise e
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/chains/base.py", line 134, in __call__
    self._call(inputs, run_manager=run_manager)
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/chains/retrieval_qa/base.py", line 120, in _call
    answer = self.combine_documents_chain.run(
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/chains/base.py", line 239, in run
    return self(kwargs, callbacks=callbacks)[self.output_keys[0]]
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/chains/base.py", line 140, in __call__
    raise e
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/chains/base.py", line 134, in __call__
    self._call(inputs, run_manager=run_manager)
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py", line 84, in _call
    output, extra_return_dict = self.combine_docs(
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 87, in combine_docs
    return self.llm_chain.predict(callbacks=callbacks, **inputs), {}
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/chains/llm.py", line 213, in predict
    return self(kwargs, callbacks=callbacks)[self.output_key]
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/chains/base.py", line 140, in __call__
    raise e
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/chains/base.py", line 134, in __call__
    self._call(inputs, run_manager=run_manager)
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/chains/llm.py", line 69, in _call
    response = self.generate([inputs], run_manager=run_manager)
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/chains/llm.py", line 79, in generate
    return self.llm.generate_prompt(
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/llms/base.py", line 134, in generate_prompt
    return self.generate(prompt_strings, stop=stop, callbacks=callbacks)
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/llms/base.py", line 191, in generate
    raise e
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/llms/base.py", line 185, in generate
    self._generate(prompts, stop=stop, run_manager=run_manager)
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/llms/base.py", line 436, in _generate
    self._call(prompt, stop=stop, run_manager=run_manager)
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/llms/llamacpp.py", line 225, in _call
    for token in self.stream(prompt=prompt, stop=stop, run_manager=run_manager):
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/langchain/llms/llamacpp.py", line 274, in stream
    for chunk in result:
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/llama_cpp/llama.py", line 900, in _create_completion
    raise ValueError(
ValueError: Requested tokens (1185) exceed context window of 512
2023-08-22 21:07:24,458 - INFO - duckdb.py:414 - Persisting DB to disk, putting it in the save folder: /home/shuaishao/ai/localgpt_llama2/DB

imjwang commented 1 year ago

Oh yeah, the prompt template and fetched document can take up a lot of context.

That variable is probably being overused, sorry if it's confusing. In the future, it should be more intuitive.

But if you set it to this it should work. Alternatively you can delete kwargs["n_batch"] = 512 because that is the default. A larger batch is faster but it takes more memory. n_batch should be between 1 and the max context size of the model.

if model_basename is not None:
        if ".ggml" in model_basename:
            logging.info("Using Llamacpp for GGML quantized models")
            model_path = hf_hub_download(repo_id=model_id, filename=model_basename)
            max_ctx_size = # 4096 or 2048
            kwargs = {
                "model_path": model_path,
                "n_ctx": max_ctx_size,
                "max_tokens": max_ctx_size,
            }
            if device_type.lower() == "mps":
                kwargs["n_gpu_layers"] = 1000
            if device_type.lower() == "cuda":
                kwargs["n_gpu_layers"] = 10
                kwargs["n_batch"] = 512
            return LlamaCpp(**kwargs)

shao-shuai commented 1 year ago

Reference in new issue

Thanks, sorry to have a new error, :sweat:

code

if model_basename is not None:
        if ".ggml" in model_basename:
            logging.info("Using Llamacpp for GGML quantized models")
            model_path = hf_hub_download(repo_id=model_id, filename=model_basename)
            max_ctx_size = 4096
            kwargs = {
                "model_path": model_path,
                "n_ctx": max_ctx_size,
                "max_tokens": max_ctx_size,
            }
            if device_type.lower() == "mps":
                kwargs["n_gpu_layers"] = 1000
            if device_type.lower() == "cuda":
                kwargs["n_gpu_layers"] = 10
                kwargs["n_batch"] = 512
            return LlamaCpp(**kwargs)

python run_localGPT.py 
2023-08-22 22:19:43,191 - INFO - run_localGPT.py:180 - Running on: cuda
2023-08-22 22:19:43,191 - INFO - run_localGPT.py:181 - Display Source Documents set to: False
2023-08-22 22:19:43,442 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-08-22 22:19:46,194 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-08-22 22:19:46,199 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: /home/shuaishao/ai/localgpt_llama2/DB
2023-08-22 22:19:46,205 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-08-22 22:19:46,211 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-08-22 22:19:46,243 - INFO - duckdb.py:460 - loaded in 162 embeddings
2023-08-22 22:19:46,243 - INFO - duckdb.py:472 - loaded in 1 collections
2023-08-22 22:19:46,244 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-08-22 22:19:46,244 - INFO - run_localGPT.py:45 - Loading Model: TheBloke/Llama-2-7B-Chat-GGML, on: cuda
2023-08-22 22:19:46,244 - INFO - run_localGPT.py:46 - This action can take a few minutes!
2023-08-22 22:19:46,244 - INFO - run_localGPT.py:50 - Using Llamacpp for GGML quantized models
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6
llama.cpp: loading model from /home/shuaishao/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 3240.13 MB (+ 2048.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1690 MB
llama_new_context_with_model: kv self size  = 2048.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

Enter a query: hello
CUDA error 222 at /tmp/pip-install-mz8bf_kh/llama-cpp-python_25e6d8928edb4ce3aaec7f2d42adec14/vendor/llama.cpp/ggml-cuda.cu:5818: the provided PTX was compiled with an unsupported toolchain.
/arrow/cpp/src/arrow/filesystem/s3fs.cc:2598:  arrow::fs::FinalizeS3 was not called even though S3 was initialized.  This could lead to a segmentation fault at exit

does this mean there is a mismatch between cuda and nvidia driver?

imjwang commented 1 year ago

Yes, it might be mismatch. Have you tried installing pytorch nightly 12.1? If that doesn't work, you can probably downgrade your nvcc to a version that matches.

check version

pip list | grep torch

install

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

imjwang commented 1 year ago

Hi @shao-shuai, were you able to resolve?

shao-shuai commented 1 year ago

Hi @shao-shuai, were you able to resolve?

sorry, caught by a flu, will let you know.

shao-shuai commented 1 year ago

pytorch nightly 12.1

I installed pytorch nightly 12.1

pip list | grep torch
pytorch-triton                2.1.0+e6216047b8
torch                         2.1.0.dev20230830+cu121
torchaudio                    2.1.0.dev20230830+cu121
torchvision                   0.16.0.dev20230830+cu121

still got the mismatch error

python run_localGPT.py 
2023-08-30 10:22:20,410 - INFO - run_localGPT.py:180 - Running on: cuda
2023-08-30 10:22:20,410 - INFO - run_localGPT.py:181 - Display Source Documents set to: False
2023-08-30 10:22:20,586 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-08-30 10:22:23,393 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-08-30 10:22:23,413 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: /home/shuaishao/ai/localgpt_llama2/DB
2023-08-30 10:22:23,425 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-08-30 10:22:23,436 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-08-30 10:22:23,879 - INFO - duckdb.py:460 - loaded in 162 embeddings
2023-08-30 10:22:23,881 - INFO - duckdb.py:472 - loaded in 1 collections
2023-08-30 10:22:23,883 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-08-30 10:22:23,883 - INFO - run_localGPT.py:45 - Loading Model: TheBloke/Llama-2-7B-Chat-GGML, on: cuda
2023-08-30 10:22:23,883 - INFO - run_localGPT.py:46 - This action can take a few minutes!
2023-08-30 10:22:23,883 - INFO - run_localGPT.py:50 - Using Llamacpp for GGML quantized models
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6
llama.cpp: loading model from /home/shuaishao/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 3240.13 MB (+ 2048.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1690 MB
llama_new_context_with_model: kv self size  = 2048.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

Enter a query: hello
CUDA error 222 at /tmp/pip-install-mz8bf_kh/llama-cpp-python_25e6d8928edb4ce3aaec7f2d42adec14/vendor/llama.cpp/ggml-cuda.cu:5818: the provided PTX was compiled with an unsupported toolchain.
/arrow/cpp/src/arrow/filesystem/s3fs.cc:2598:  arrow::fs::FinalizeS3 was not called even though S3 was initialized.  This could lead to a segmentation fault at exit

imjwang commented 1 year ago

Hi, hope you are feeling better! Can you try to install cuda 12.1?

shao-shuai commented 1 year ago

Hi, hope you are feeling better! Can you try to install cuda 12.1?

sure, thanks, much better now.

I tried cuda 12.1, still same error

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

python run_localGPT.py 
2023-08-31 15:10:37,916 - INFO - run_localGPT.py:180 - Running on: cuda
2023-08-31 15:10:37,916 - INFO - run_localGPT.py:181 - Display Source Documents set to: False
2023-08-31 15:10:38,108 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-08-31 15:10:40,827 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-08-31 15:10:40,833 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: /home/shuaishao/ai/localgpt_llama2/DB
2023-08-31 15:10:40,842 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-08-31 15:10:40,852 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-08-31 15:10:40,946 - INFO - duckdb.py:460 - loaded in 162 embeddings
2023-08-31 15:10:40,947 - INFO - duckdb.py:472 - loaded in 1 collections
2023-08-31 15:10:40,948 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-08-31 15:10:40,948 - INFO - run_localGPT.py:45 - Loading Model: TheBloke/Llama-2-7B-Chat-GGML, on: cuda
2023-08-31 15:10:40,948 - INFO - run_localGPT.py:46 - This action can take a few minutes!
2023-08-31 15:10:40,948 - INFO - run_localGPT.py:50 - Using Llamacpp for GGML quantized models
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6
llama.cpp: loading model from /home/shuaishao/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 3240.13 MB (+ 2048.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1690 MB
llama_new_context_with_model: kv self size  = 2048.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

Enter a query: hello
CUDA error 222 at /tmp/pip-install-mz8bf_kh/llama-cpp-python_25e6d8928edb4ce3aaec7f2d42adec14/vendor/llama.cpp/ggml-cuda.cu:5818: the provided PTX was compiled with an unsupported toolchain.
/arrow/cpp/src/arrow/filesystem/s3fs.cc:2598:  arrow::fs::FinalizeS3 was not called even though S3 was initialized.  This could lead to a segmentation fault at exit

imjwang commented 1 year ago

Did you reinstall llama-cpp?

shao-shuai commented 1 year ago

Did you reinstall llama-cpp?

should I run this?

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir

imjwang commented 1 year ago

yes!

shao-shuai commented 1 year ago

yes!

updated llama-cpp, but can't load model this time:disappointed_relieved:

python run_localGPT.py 
2023-09-01 08:45:14,148 - INFO - run_localGPT.py:180 - Running on: cuda
2023-09-01 08:45:14,148 - INFO - run_localGPT.py:181 - Display Source Documents set to: False
2023-09-01 08:45:14,246 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-09-01 08:45:16,614 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-09-01 08:45:16,617 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: /home/shuaishao/ai/localgpt_llama2/DB
2023-09-01 08:45:16,621 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-09-01 08:45:16,625 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-09-01 08:45:16,653 - INFO - duckdb.py:460 - loaded in 162 embeddings
2023-09-01 08:45:16,653 - INFO - duckdb.py:472 - loaded in 1 collections
2023-09-01 08:45:16,654 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-09-01 08:45:16,654 - INFO - run_localGPT.py:45 - Loading Model: TheBloke/Llama-2-7B-Chat-GGML, on: cuda
2023-09-01 08:45:16,654 - INFO - run_localGPT.py:46 - This action can take a few minutes!
2023-09-01 08:45:16,654 - INFO - run_localGPT.py:50 - Using Llamacpp for GGML quantized models
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6
gguf_init_from_file: invalid magic number 67676a74
error loading model: llama_model_loader: failed to load model from /home/shuaishao/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_K_M.bin

llama_load_model_from_file: failed to load model
Traceback (most recent call last):
  File "/home/shuaishao/ai/localgpt_llama2/run_localGPT.py", line 246, in <module>
    main()
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/shuaishao/anaconda3/envs/localgpt_llama2/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/shuaishao/ai/localgpt_llama2/run_localGPT.py", line 209, in main
    llm = load_model(device_type, model_id=MODEL_ID, model_basename=MODEL_BASENAME)
  File "/home/shuaishao/ai/localgpt_llama2/run_localGPT.py", line 63, in load_model
    return LlamaCpp(**kwargs)
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for LlamaCpp
__root__
  Could not load Llama model from path: /home/shuaishao/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_K_M.bin. Received error  (type=value_error)
2023-09-01 08:45:18,837 - INFO - duckdb.py:414 - Persisting DB to disk, putting it in the save folder: /home/shuaishao/ai/localgpt_llama2/DB

imjwang commented 1 year ago

Wow, so it looks like ggml is no longer supported with llama.cpp. They phased out in favor of gguf.

But in the meantime, if you can probably fix with

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.78 --no-cache-dir

Suiji12 commented 4 months ago

2023-08-20 14:20:27,502 - INFO - run_localGPT.py:180 - Running on: cuda
2023-08-20 14:20:27,502 - INFO - run_localGPT.py:181 - Display Source Documents set to: True
2023-08-20 14:20:27,690 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-08-20 14:20:30,007 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-08-20 14:20:30,011 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: /home/shuaishao/ai/localgpt_llama2/DB
2023-08-20 14:20:30,014 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-08-20 14:20:30,019 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-08-20 14:20:30,046 - INFO - duckdb.py:460 - loaded in 144 embeddings
2023-08-20 14:20:30,047 - INFO - duckdb.py:472 - loaded in 1 collections
2023-08-20 14:20:30,048 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-08-20 14:20:30,048 - INFO - run_localGPT.py:45 - Loading Model: TheBloke/Llama-2-7B-Chat-GGML, on: cuda
2023-08-20 14:20:30,048 - INFO - run_localGPT.py:46 - This action can take a few minutes!
2023-08-20 14:20:30,048 - INFO - run_localGPT.py:50 - Using Llamacpp for GGML quantized models
llama.cpp: loading model from /home/shuaishao/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

Enter a query: please tell me the details of the second amendment 

llama_print_timings:        load time = 74359.92 ms
llama_print_timings:      sample time =    78.86 ms /   166 runs   (    0.48 ms per token,  2104.86 tokens per second)
llama_print_timings: prompt eval time = 74359.80 ms /  1109 tokens (   67.05 ms per token,    14.91 tokens per second)
llama_print_timings:        eval time = 41306.74 ms /   165 runs   (  250.34 ms per token,     3.99 tokens per second)
llama_print_timings:       total time = 116048.42 ms

> Question:
please tell me the details of the second amendment

> Answer:
 The Second Amendment to the United States Constitution states that "A well-regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed." This means that individuals have the right to own and carry firearms as part of a militia, which is a group of citizens who are trained and equipped to defend their state or country. The amendment does not explicitly prohibit the government from regulating or restricting the ownership of firearms in other contexts, such as for personal protection or hunting. However, the Supreme Court has interpreted this amendment to apply to all forms of gun ownership and use, and to limit any attempts by the government to restrict these rights.

GPU: Nvidia 3060 6 GB RAM: 16 GB

Is there any way to fix this? I thought llama.cpp is working on GPU but seems not? #390

How can I print the Operational information like you？for example the time you cost？