LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.55k stars 371 forks source link

Using GPU VRAM by useclblast and gpulayers cause much slower speed #248

Open ZacharyHu0 opened 1 year ago

ZacharyHu0 commented 1 year ago

Problem

When I using the wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin model from Hugging Face with koboldcpp, I found out unexpectedly that adding useclblast and gpulayers results in much slower token output speed. I will be much appreciated if anyone could help to explain or find out the glitch.

Platform

CPU: AMD Ryzen 7950x (16C,32T) GPU: AMD Radeon RX 6800 (16GB VRAM) MEM: 64GB (DDR5, 6200MHz, 2*32GB) SYS: Windows 11 22621.1848 using powershell 7.3

Using released binary file koboldcpp-1.31

Commands

without GPU:

.\koboldcpp.exe --lora C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\WizardCoder-15B-1.0.ggmlv3.q5_1.bin --stream --launch

with GPU:

.\koboldcpp.exe --lora C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\WizardCoder-15B-1.0.ggmlv3.q5_1.bin --stream --launch --useclblast 0 0 --gpulayers 43

Conversation

both giving the order write a python function to plot a heart shape using matlibplot

Observation

without GPU:

using ~20GB MEM Time Taken - Processing:2.4s (108ms/T), Generation:116.7s (362ms/T), Total:119.1s (2.7T/s)

with GPU:

using ~20GB MEM using ~15.6GB VMEM Time Taken - Processing:12.6s (575ms/T), Generation:306.4s (952ms/T), Total:319.1s (1.0T/s)

Log

Notice: omitted generating process and manually add\ in log to avoid broken formation.

without GPU:

(base) PS C:\Users\Hao\Downloads> .\koboldcpp.exe C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin  --stream --launch
Welcome to KoboldCpp - Version 1.31
Attempting to use OpenBLAS library for faster prompt ingestion. A compatible libopenblas will be required.
Initializing dynamic library: koboldcpp_openblas.dll
==========
Loading model: C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin
[Threads: 15, BlasThreads: 15, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 17 (mostly Q5_K - Medium)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 24255.89 MB (+ 3124.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  = 3120.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001
127.0.0.1 - - [18/Jun/2023 11:06:20] "GET / HTTP/1.1" 302 -
Force redirect to streaming mode, as --stream is set.
127.0.0.1 - - [18/Jun/2023 11:06:21] "GET /?streaming=1 HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 11:06:21] "GET /api/v1/model HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 11:06:21] "GET /api/v1/info/version HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 11:06:21] "GET /sw.js HTTP/1.1" 404 -
127.0.0.1 - - [18/Jun/2023 11:06:21] "GET /manifest.json HTTP/1.1" 404 -
127.0.0.1 - - [18/Jun/2023 11:06:22] "GET /api/extra/version HTTP/1.1" 200 -

Input: {"n": 1, "max_context_length": 2048, "max_length": 322, "rep_pen": 1.08, "temperature": 0.44, "top_p": 0.92, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 256, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 2, 3, 4, 5], "prompt": "### Instruction:write a python function to plot a heart shape using matlibplot### Response:", "quiet": true, "stop_sequence": ["### Instruction:", "### Response:"]}

Processing Prompt (8 / 22 tokens)127.0.0.1 - - [18/Jun/2023 11:07:28] "POST /api/extra/generate/check HTTP/1.1" 200 -
Processing Prompt (22 / 22 tokens)127.0.0.1 - - [18/Jun/2023 11:07:29] "POST /api/extra/generate/check HTTP/1.1" 200 -

Generating…      (omitted)

Generating (322 / 322 tokens)
Time Taken - Processing:2.4s (108ms/T), Generation:116.7s (362ms/T), Total:119.1s (2.7T/s)
Output: Here's an example Python function that uses Matplotlib to plot a heart shape:
\```python
import matplotlib.pyplot as plt
def plot_heart():
    # Define the x and y coordinates of the heart shape
    x = [0, 1, 0.5, 1, 0]
    y = [0, 0.5, 0.75, 0.25, 0]

    # Plot the heart shape using Matplotlib
    plt.plot(x, y)
    plt.axis('off')
    plt.show()
\```
To use this function, simply call it with no arguments:
\```python
plot_heart()
\```
This will display a window containing the heart shape plot. You can adjust the x and y coordinates to change the shape of the heart, or modify the function to add additional features such as labels or colors.
Note that you may need to install the Matplotlib library before running this code. You can do this by running `pip install matplotlib` in your terminal or command prompt.
Also note that this function assumes that you are using Python 3.x. If you are using Python 2.x, you may need to modify the syntax slightly (e.g. use `raw_input()` instead of `input()`).
I hope this helps! Let me know if you have any further questions.
\```python
import matplotlib.pyplot as plt
def plot_heart():
    # Define
127.0.0.1 - - [18/Jun/2023 11:09:27] "POST /api/v1/generate/ HTTP/1.1" 200 -

with GPU:

(base) PS C:\Users\Hao\Downloads> .\koboldcpp.exe C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin  --stream --launch --useclblast 0 0 --gpulayers 43
Welcome to KoboldCpp - Version 1.31
Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast will be required.
Initializing dynamic library: koboldcpp_clblast.dll
==========
Loading model: C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin
[Threads: 15, BlasThreads: 15, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 17 (mostly Q5_K - Medium)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB

Platform:0 Device:0  - AMD Accelerated Parallel Processing with gfx1030
Platform:0 Device:1  - AMD Accelerated Parallel Processing with gfx1030

ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx1030'
ggml_opencl: device FP16 support: true
CL FP16 temporarily disabled pending further optimization.
llama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required  = 8755.06 MB (+ 3124.00 MB per state)
llama_model_load_internal: offloading 43 repeating layers to GPU
llama_model_load_internal: offloaded 43/63 layers to GPU
llama_model_load_internal: total VRAM used: 15501 MB
....................................................................................................
llama_init_from_file: kv self size  = 3120.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001
127.0.0.1 - - [18/Jun/2023 10:58:53] "GET / HTTP/1.1" 302 -
Force redirect to streaming mode, as --stream is set.
127.0.0.1 - - [18/Jun/2023 10:58:54] "GET /?streaming=1 HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 10:58:54] "GET /api/v1/model HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 10:58:54] "GET /api/v1/info/version HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 10:58:54] "GET /sw.js HTTP/1.1" 404 -
127.0.0.1 - - [18/Jun/2023 10:58:54] "GET /manifest.json HTTP/1.1" 404 -
127.0.0.1 - - [18/Jun/2023 10:58:55] "GET /api/extra/version HTTP/1.1" 200 -

Input: {"n": 1, "max_context_length": 2048, "max_length": 322, "rep_pen": 1.08, "temperature": 0.44, "top_p": 0.92, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 256, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 2, 3, 4, 5], "prompt": "### Instruction:write a python function to plot a heart shape using matlibplot### Response:", "quiet": true, "stop_sequence": ["### Instruction:", "### Response:"]}

Processing Prompt (8 / 22 tokens)127.0.0.1 - - [18/Jun/2023 10:59:26] "POST /api/extra/generate/check HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 10:59:27] "POST /api/extra/generate/check HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 10:59:28] "POST /api/extra/generate/check HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 10:59:29] "POST /api/extra/generate/check HTTP/1.1" 200 -
Processing Prompt (16 / 22 tokens)127.0.0.1 - - [18/Jun/2023 10:59:30] "POST /api/extra/generate/check HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 10:59:31] "POST /api/extra/generate/check HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 10:59:32] "POST /api/extra/generate/check HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 10:59:33] "POST /api/extra/generate/check HTTP/1.1" 200 -
Processing Prompt (22 / 22 tokens)127.0.0.1 - - [18/Jun/2023 10:59:34] "POST /api/extra/generate/check HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 10:59:35] "POST /api/extra/generate/check HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 10:59:36] "POST /api/extra/generate/check HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 10:59:37] "POST /api/extra/generate/check HTTP/1.1" 200 -

Generating…… (omitted)

Generating (322 / 322 tokens)
Time Taken - Processing:12.6s (575ms/T), Generation:306.4s (952ms/T), Total:319.1s (1.0T/s)
Output: Here's an example Python function that uses Matplotlib to plot a heart shape:
\```python
import matplotlib.pyplot as plt
def plot_heart():
    # Define the x and y coordinates of the heart shape
    x = [0, 1, 0.5, 1, 0.5, 0]
    y = [0, 0.5, 0.75, 1, 0.75, 0]

    # Plot the heart shape
    plt.plot(x, y)

    # Add labels and title
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Heart Shape')

    # Show the plot
    plt.show()
\```
You can call this function in your main program to plot the heart shape:
\```python
plot_heart()
\```
This will display a window with the heart shape plot. You can adjust the x and y coordinates to change the shape of the heart. Additionally, you can modify the labels and title to customize the plot. Finally, you can add additional features to the plot using other Matplotlib functions.
Note: To use this function, you need to have Matplotlib installed and import it at the beginning of your Python script. You can install Matplotlib using pip or conda.
\```
pip install matplotlib
\```
or
\```
conda install matplotlib
\```
depending on whether you
127.0.0.1 - - [18/Jun/2023 11:04:44] "POST /api/v1/generate/ HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 11:04:44] "POST /api/extra/generate/check HTTP/1.1" 200 -
SchezoWegey commented 1 year ago

I have this same issue too. Though I found that the CUDA only version is the fastest. 12 threads CLBlast with gpulayers 14 seems to be the fastest for me but anything higher than that runs way slower. GPU: GTX 3070 Ti (8GB) CPU: Ryzen 5600 (6C, 12T) RAM: x4 8GB DDR4-3200MHz

Nothing is overclocked.

ZacharyHu0 commented 1 year ago

I did several further bench test. Here are my results. For the same task, with 30b model, running on 15 cores: Pure 63 layers on CPU: Time Taken - Processing:162.4s (108ms/T), Generation:116.7s (362ms/T), Total:279.1s (2.2T/s) offloaded 2/63 layers to GPU: Time Taken - Processing:56.3s (37ms/T), Generation:170.1s (408ms/T), Total:226.4s (1.8T/s) (using ~5GB VRAM) offloaded 8/63 layers to GPU: Time Taken - Processing:62.7s (41ms/T), Generation:169.7s (407ms/T), Total:232.4s (1.8T/s) (using ~7GB VRAM) offloaded 14/63 layers to GPU: Time Taken - Processing:60.5s (39ms/T), Generation:165.1s (396ms/T), Total:225.5s (1.8T/s) (using ~9GB VRAM) offloaded 22/63 layers to GPU: Time Taken - Processing:59.8s (39ms/T), Generation:157.0s (377ms/T), Total:216.8s (1.9T/s) (using ~12GB VRAM)

CPU and MEM usage doesn't seem to be affected. And during the test, unused MEM space is enough(20GB+).

From these numbers, I think introduciong enough GPU layer can accelerate Processing but slowing down Generation. Also, more GPU payer can speed up Generation step, but that may need much more layer and VRAM than most GPU can process and offer (maybe 60+ layer?). My guess is that the GPU-CPU cooperation or convertion during Processing part cost too much time. At the same time, GPU layer didn't really do any help in Generation part.

LostRuins commented 1 year ago

@ZacharyHu0 you may be using the wrong GPU, since you have two GPUs, it looks like it used the Ryzen 7950x instead of the RX6800. Can you try replacing --useclblast 0 0 with --useclblast 0 1 instead, and see if there is any difference?

Also try messing around with the number of layers offloaded, reduce it a bit if it doesn't fit.

Kaplas80 commented 1 year ago

I have the same problem. I have a RX6700XT and offloading only part of the layers to GPU gives slower processing times.

For 13B models, I can offload all the layers to GPU and it is fast both in processing and generating... but, for 30B models that doesn't fully fit in VRAM, I get the best times using clblast with 0 layers offloaded.

It happens both in Windows 10 and Linux.

My setup: CPU: AMD Ryzen 5700G (8C,16T) (iGPU disabled in BIOS) GPU: AMD Radeon RX 6700 XT (12GB VRAM) MEM: 80GB (DDR4, 3200MHz, 2x32GB + 2x8GB)

Edit: I've been doing some tests with different settings. These are the results:


Testing environment: Debian 12

Instruction mode with default Kobold Lite settings (generate 80 tokens).

Prompt (73 tokens)

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write an email to my electric company asking for a discount on my tariff. Be polite.
### Response:

Model: wizardLM-13B-Uncensored.ggmlv3.q6_K.bin

No BLAS:

Command: python3 koboldcpp.py --threads 8 --noblas --model /media/kaplas/NVME/koboldcpp/wizardLM-13B-Uncensored.ggmlv3.q6_K.bin Run 1: Time Taken - Processing:5.5s (75ms/T), Generation:21.2s (265ms/T), Total:26.7s (3.0T/s) Run 2: Time Taken - Processing:5.5s (76ms/T), Generation:21.4s (268ms/T), Total:27.0s (3.0T/s)

OpenBLAS:

Command: python3 koboldcpp.py --threads 8 --model /media/kaplas/NVME/koboldcpp/wizardLM-13B-Uncensored.ggmlv3.q6_K.bin Run 1: Time Taken - Processing:31.8s (435ms/T), Generation:21.3s (267ms/T), Total:53.1s (1.5T/s) Run 2: Time Taken - Processing:31.8s (435ms/T), Generation:21.2s (266ms/T), Total:53.0s (1.5T/s)

CLBlast (0/43 GPU layers):

Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --model /media/kaplas/NVME/koboldcpp/wizardLM-13B-Uncensored.ggmlv3.q6_K.bin Run 1: Time Taken - Processing:6.9s (95ms/T), Generation:21.3s (266ms/T), Total:28.2s (2.8T/s) Run 2: Time Taken - Processing:6.9s (95ms/T), Generation:21.2s (265ms/T), Total:28.1s (2.9T/s)

CLBlast (20/43 GPU layers - total VRAM used: 4964 MB):

Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 20 --model /media/kaplas/NVME/koboldcpp/wizardLM-13B-Uncensored.ggmlv3.q6_K.bin Run 1: Time Taken - Processing:5.6s (76ms/T), Generation:13.9s (174ms/T), Total:19.5s (4.1T/s) Run 2: Time Taken - Processing:5.6s (76ms/T), Generation:13.9s (174ms/T), Total:19.4s (4.1T/s)

CLBlast (43/43 GPU layers - total VRAM used: 11536 MB):

Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 43 --model /media/kaplas/NVME/koboldcpp/wizardLM-13B-Uncensored.ggmlv3.q6_K.bin Run 1: Time Taken - Processing:4.6s (63ms/T), Generation:6.1s (77ms/T), Total:10.8s (7.4T/s) Run 2: Time Taken - Processing:4.6s (63ms/T), Generation:6.2s (78ms/T), Total:10.8s (7.4T/s)

Model: wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin

No BLAS:

Command: python3 koboldcpp.py --threads 8 --noblas --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin Run 1: Time Taken - Processing:11.0s (151ms/T), Generation:38.8s (485ms/T), Total:49.8s (1.6T/s) Run 2: Time Taken - Processing:11.1s (152ms/T), Generation:39.0s (488ms/T), Total:50.1s (1.6T/s)

OpenBLAS:

Command: python3 koboldcpp.py --threads 8 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin Run 1: Time Taken - Processing:36.7s (503ms/T), Generation:38.8s (485ms/T), Total:75.5s (1.1T/s) Run 2: Time Taken - Processing:36.7s (502ms/T), Generation:38.7s (484ms/T), Total:75.4s (1.1T/s)

CLBlast (0/63 GPU layers):

Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin Run 1: Time Taken - Processing:13.7s (188ms/T), Generation:38.7s (483ms/T), Total:52.4s (1.5T/s) Run 2: Time Taken - Processing:13.7s (188ms/T), Generation:38.7s (484ms/T), Total:52.4s (1.5T/s)

CLBlast (10/63 GPU layers - total VRAM used: 3233 MB):

Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 10 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin Run 1: Time Taken - Processing:12.6s (173ms/T), Generation:34.8s (435ms/T), Total:47.4s (1.7T/s) Run 2: Time Taken - Processing:12.6s (173ms/T), Generation:34.7s (434ms/T), Total:47.3s (1.7T/s)

CLBlast (20/63 GPU layers - total VRAM used: 6224 MB):

Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 20 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin Run 1: Time Taken - Processing:12.0s (165ms/T), Generation:31.5s (394ms/T), Total:43.6s (1.8T/s) Run 2: Time Taken - Processing:12.0s (165ms/T), Generation:31.5s (394ms/T), Total:43.6s (1.8T/s)

CLBlast (39/63 GPU layers - total VRAM used: 11960 MB):

Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 39 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin Run 1: Time Taken - Processing:57.2s (784ms/T), Generation:25.8s (323ms/T), Total:83.0s (1.0T/s) Run 2: Time Taken - Processing:57.6s (789ms/T), Generation:25.8s (322ms/T), Total:83.4s (1.0T/s)


I ran each test twice, to make sure the results were consistent and, as you can see, the processing time in the 30B model is much worse with 39 layers offloaded to GPU.

Kaplas80 commented 1 year ago

I've tested other values, and it seems that 33 layers is the optimal value for my GPU with this model. At higher values, processing times worsen.

CLBlast (30/63 GPU layers - total VRAM used: 9256 MB):

Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 30 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin Run 1: Time Taken - Processing:11.5s (157ms/T), Generation:27.6s (345ms/T), Total:39.0s (2.0T/s) Run 2: Time Taken - Processing:11.4s (157ms/T), Generation:27.4s (343ms/T), Total:38.9s (2.1T/s)

CLBlast (31/63 GPU layers - total VRAM used: 9543 MB):

Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 31 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin Run 1: Time Taken - Processing:11.6s (158ms/T), Generation:27.2s (340ms/T), Total:38.7s (2.1T/s) Run 2: Time Taken - Processing:11.6s (158ms/T), Generation:27.2s (340ms/T), Total:38.8s (2.1T/s)

CLBlast (32/63 GPU layers - total VRAM used: 9830 MB):

Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 32 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin Run 1: Time Taken - Processing:11.5s (158ms/T), Generation:26.9s (337ms/T), Total:38.4s (2.1T/s) Run 2: Time Taken - Processing:11.6s (159ms/T), Generation:26.9s (336ms/T), Total:38.5s (2.1T/s)

CLBlast (33/63 GPU layers - total VRAM used: 10157 MB):

Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 33 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin Run 1: Time Taken - Processing:11.5s (158ms/T), Generation:26.5s (331ms/T), Total:38.0s (2.1T/s) Run 2: Time Taken - Processing:11.6s (159ms/T), Generation:26.5s (331ms/T), Total:38.1s (2.1T/s)

CLBlast (34/63 GPU layers - total VRAM used: 10444 MB):

Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 34 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin Run 1: Time Taken - Processing:13.4s (183ms/T), Generation:26.3s (328ms/T), Total:39.6s (2.0T/s) Run 2: Time Taken - Processing:13.0s (178ms/T), Generation:26.2s (328ms/T), Total:39.3s (2.0T/s)

CLBlast (35/63 GPU layers - total VRAM used: 10732 MB):

Command: python3 koboldcpp.py --threads 8 --useclblast 0 0 --gpulayers 35 --model /media/kaplas/NVME/koboldcpp/wizardlm-30b-uncensored.ggmlv3.q4_K_M.bin Run 1: Time Taken - Processing:15.1s (207ms/T), Generation:25.9s (324ms/T), Total:41.0s (2.0T/s) Run 2: Time Taken - Processing:14.1s (193ms/T), Generation:26.0s (325ms/T), Total:40.0s (2.0T/s)

Tomorrow, I'll try with other quantizations.

ZacharyHu0 commented 1 year ago

@LostRuins Thanks for your help! Actually, I disabled iGPU in BIOS (UEFI). So it's another mystery why I got two gfx1030 , since iGPU is named gfx1036:

Platform:0 Device:0  - AMD Accelerated Parallel Processing with gfx1030
Platform:0 Device:1  - AMD Accelerated Parallel Processing with gfx1030

image

Anyway, I tried your advice and here comes the results: All test ran under auto params: [Threads: 15, BlasThreads: 15, SmartContext: False] These tests simulates long conversation(1536 tokens) and medium response(417 / 512 tokens).

CPU (OpenBLAS, run twice as the baseline) Command: C:\Users\Hao\Downloads\koboldcpp.exe C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin --stream --launch Time Taken - Processing:139.6s (91ms/T), Generation:168.5s (404ms/T), Total:308.1s (1.4T/s) Time Taken - Processing:139.2s (91ms/T), Generation:169.7s (407ms/T), Total:309.0s (1.3T/s)

GPU 0 0(CLBlast) Command:C:\Users\Hao\Downloads\koboldcpp.exe C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin --stream --launch --useclblast 0 1 --gpulayers 10 Time Taken - Processing:55.9s (36ms/T), Generation:159.9s (384ms/T), Total:215.9s (1.9T/s)

Command:C:\Users\Hao\Downloads\koboldcpp.exe C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin --stream --launch --useclblast 0 0 --gpulayers 30 Time Taken - Processing:55.5s (36ms/T), Generation:169.4s (406ms/T), Total:224.9s (1.9T/s)

GPU 0 1(CLBlast) Command:C:\Users\Hao\Downloads\koboldcpp.exe C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin --stream --launch --useclblast 0 0 --gpulayers 10 Time Taken - Processing:54.5s (35ms/T), Generation:160.4s (385ms/T), Total:214.9s (1.9T/s)

Command:C:\Users\Hao\Downloads\koboldcpp.exe C:\Users\Hao\AppData\Local\nomic.ai\GPT4All\wizardlm-30b-uncensored.ggmlv3.q5_K_M.bin --stream --launch --useclblast 0 1 --gpulayers 30 Time Taken - Processing:55.5s (36ms/T), Generation:173.5s (416ms/T), Total:229.0s (1.8T/s)

Both gfx1030 have the same performance. So it might be two parts of the Radeon RX6800? Changing number of layers offloaded has very little influence on performance, less than 10% for 0 to 40 layers.

harakiru commented 1 year ago

Either something is wrong or you are running out of vram and its swapping from regular ram. I also have a RX 6800xt (which is also gfx1030) and im getting about 8T/s on a 13B model. With everything loaded on the gpu it uses about 11-12 GB VRAM. I dont know why you have two gfx1030 devices, might be a windows thing.

LostRuins commented 1 year ago

It's also possible that the system is bottlenecked somewhere else (e.g. memory transfer)

ZacharyHu0 commented 1 year ago

Either something is wrong or you are running out of vram and its swapping from regular ram. I also have a RX 6800xt (which is also gfx1030) and im getting about 8T/s on a 13B model. With everything loaded on the gpu it uses about 11-12 GB VRAM. I dont know why you have two gfx1030 devices, might be a windows thing.

I'm running a 30B model, which should be slower for sure. For 13B models, I can get 6-8T/s using VRAM. I monitored the VRAM during tests, and 0-30 layer (of totally 63) certainly wouldn't used up 16 GB VRAM. May I see your log for comparison?Many thanks!

harakiru commented 1 year ago

I've downloaded Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_K_M and loaded it up with 45 layers on VRAM which takes up about 15GB of VRAM, anything more spills over and gets really slow. With default settings these are my results: Processing Prompt (1 / 1 tokens) Generating (80 / 80 tokens) Time Taken - Processing:0.3s (349ms/T), Generation:27.2s (340ms/T), Total:27.6s (2.9T/s) So your results seem to be normal i thought you were trying to run a 13B model earlier.

LostRuins commented 1 year ago

300ms/T is very good for a 30B model btw. I get about double that timing.

FitzWM commented 1 year ago

I'm getting this same behavior across a range of 13B and 30B GGML models. Even with 13B models, where I can fit every layer into VRAM comfortably, actually doing so almost always slows down my response time considerably. The weird thing is that the affect seems to differ between processing and generation. Generation seems to benefit from having more / all layers in VRAM, whereas processing is much, much faster with a lower setting for gpulayers - in my case, around 20-25. The end result is that "optimizing" the value for gpulayers seems to speed things up overall, but I wonder if things could be improved even more by allowing different settings for gpulayers in processing and generation. Of course, that could be completely out of scope for the way things work, for all I know.

Edit: I've also noticed that lowering threads from 8 to 6 (I'm using a 5800X with 8 cores and 16 threads) seems to provide a significant boost, as well. Not sure if the two are related.