ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.31k stars 9.54k forks source link

Unable to use Intel UHD GPU acceleration with BLAS #1761

Closed Foul-Tarnished closed 6 months ago

Foul-Tarnished commented 1 year ago

Expected Behavior

GPU should be used when infering

Current Behavior

Here's how I built the software :

git clone https://github.com/ggerganov/llama.cpp . extracted w64devkit fortran somewhere, copied OpenBLAS required file in the folders ran w64devkit.exe cd to my llama.cpp folder make LLAMA_OPENBLAS=1

then I followed the "Intel MKL" section below :

mkdir build
cd build
cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build . --config Release

Finally I ran the app with :

.\build\bin\Release\main.exe -m ./models/7B/ggml-model-q4_0.bin -n 128 --interactive-first --color --threads 4 --mlock

But my igpu is at 2-3%, while my cpu is at 70-80%, when infering. Generation is a few words per second, on 7B, which is not bad for a bad/intel laptop cpu

Environment and Context

Foul-Tarnished commented 1 year ago

When loading main.exe with --mlock, gpu usage goes a bit higher, like 15%, but otherwise it never get used. I tried with a really long prompt too

Well I don't even know if there's even intel gpu acceleration support the readme structure is a mess, not gonna lie.

AlphaAtlas commented 1 year ago

Last I checked Intel MKL is a CPU only library. It will not use the IGP.

Also, AFAIK the "BLAS" part is only used for prompt processing. The actual text generation uses custom code for CPUs and accelerators.

You could load the IGP with clblast, but it might not actually speed things up because of the extra copies. There is not really a backend specifically targeting IGPs yet.

Yeah, the documentation is a bit lacking.

SlyEcho commented 1 year ago

The provided Windows build with CLBlast using OpenCL should work but I wouldn't expect any significant performance gains from integrated graphics.

SlyEcho commented 1 year ago

copied OpenBLAS required file in the folders then I followed the "Intel MKL" section

Which one did you actually use? Did it actually find the Intel MKL library? Because OpenBLAS doesn't give you Intel MKL. They are completely different.

sunija-dev commented 1 year ago

👋 TL;DR: CLBlast might not be faster. If there is a way to increase prompt evaluation on this system, I'd be highly interested.

I got a similar setup, though I use Koboldcpp (which is based on llama.cpp).

In general: Generation speed is fine (0.5-1 tok/s maybe). Prompt evaluation is horrible (2 tok/s). With my roleplay starting prompt of 1000 tokens, it takes easily 500s. Afterwards it generates quickly (~1-2 words/s) for 2-3 messages (because apparently it can cache the context). When it deletes older messages from the context, so it stays within the context limit, it suddenly needs to reevaluate 500-1000 tokens, which again takes minutes.

In Koboldcpp you can just select OpenBlas or CLBlast (GPU 1). I test with a 1087 token prompt and 87 generated tokens. I use a 13b q4_0 model. CPU/GPU percentages according to task manager (as far as I know those values can be hazy, especially for integrated GPUs).

OpenBlas will not use the GPU, CPU is at 80%. Needs 300s. CLBlast will use GPU at 50-100% (Switches) and 80% CPU first, then 40% CPU after some seconds. Needs 440s. Processing:390.8s (360ms/T), Generation:48.9s (753ms/T)

Next message will be quick (20s for 18 tokens evaluated and 26 generated). Time Taken - Processing:6.2s (342ms/T), Generation:14.1s (541ms/T), Total:20.2s

But after some messages, the previous prompt will change to accomodate the small context (I use Sillytavern, btw), and then it will regenerate much of the prompt, needing 300s again.

Physical hardware
Windows 10, tablet/laptop (Dell Latitude)
i5 8350U, 16gb ram, Intel UHD 620

Operating System
Windows 10 Pro
AlphaAtlas commented 1 year ago

wave TL;DR: CLBlast might not be faster. If there is a way to increase prompt evaluation on this system, I'd be highly interested.

I got a similar setup, though I use Koboldcpp (which is based on llama.cpp).

In general: Generation speed is fine (0.5-1 tok/s maybe). Prompt evaluation is horrible (2 tok/s). With my roleplay starting prompt of 1000 tokens, it takes easily 500s. Afterwards it generates quickly (~1-2 words/s) for 2-3 messages (because apparently it can cache the context). When it deletes older messages from the context, so it stays within the context limit, it suddenly needs to reevaluate 500-1000 tokens, which again takes minutes.

In Koboldcpp you can just select OpenBlas or CLBlast (GPU 1). I test with a 1087 token prompt and 87 generated tokens. I use a 13b q4_0 model. CPU/GPU percentages according to task manager (as far as I know those values can be hazy, especially for integrated GPUs).

OpenBlas will not use the GPU, CPU is at 80%. Needs 300s. CLBlast will use GPU at 50-100% (Switches) and 80% CPU first, then 40% CPU after some seconds. Needs 440s. Processing:390.8s (360ms/T), Generation:48.9s (753ms/T)

Next message will be quick (20s for 18 tokens evaluated and 26 generated). Time Taken - Processing:6.2s (342ms/T), Generation:14.1s (541ms/T), Total:20.2s

But after some messages, the previous prompt will change to accomodate the small context (I use Sillytavern, btw), and then it will regenerate much of the prompt, needing 300s again.

Physical hardware
Windows 10, tablet/laptop (Dell Latitude)
i5 8350U, 16gb ram, Intel UHD 620

Operating System
Windows 10 Pro

TBH you should test a Vulkan backend like mlc-llm. There isn't really a good way to leverage UHD 620 in llama.cpp yet, especially with max context prompts like that.

sunija-dev commented 1 year ago

TBH you should test a Vulkan backend like mlc-llm. There isn't really a good way to leverage UHD 620 in llama.cpp yet, especially with max context prompts like that.

Thanks for the tip!

I tried it, but sadly it's slower than llama.cpp. :( But it does use the GPU to 100% (according to the task manager).

mlc-llm takes 220s to evaluate the prompt with their vicunia 7b. llama.cpp takes 161s to evaluate the prompt with a 7b 4bit model.

Also, from some llama.cpp tests: Evaluation the prompt on a 13b model takes 250s (only cpu) to 380s (with a lot of GPU). So two learnings

1) Evaluating the same prompt on 13b takes longer than on 7b (I don't know why). 2) The more the UHD GPU is used, the slower prompt evaluation gets. I guess because it takes away time from the CPU, and whatever llama.cpp does on the CPU is magic that is way faster than anything the GPU does...? I'd really like to know what's going on here. Maybe the GPU only accelerates 16bit operations, so the CPU is faster because it can run the 4bit stuff...? I really don't know.

SlyEcho commented 1 year ago

Maybe the GPU only accelerates 16bit operations, so the CPU is faster because it can run the 4bit stuff...?

The OpenCL code in llama.cpp can run 4-bit generation on the GPU now, too, but it requires the model to be loaded to VRAM, which integrated GPUs don't have or have very little.

sunija-dev commented 1 year ago

The OpenCL code in llama.cpp can run 4-bit generation on the GPU now, too, but it requires the model to be loaded to VRAM, which integrated GPUs don't have or have very little.

According to the task manager there's 8 GB of Shared GPU memory/GPU memory. Does that count as VRAM in that context? Or does the Intel UHD 620 just have no VRAM?

Foul-Tarnished commented 1 year ago

it has no vram. it's just the ram being used as vram the bios does allocate some for it, but it's more for legacy purpose afaik, it will just use whatever.

the good thing is that you don't need to copy data vram->ram to access the data on cpu, it's just always shared by both

Le mar. 13 juin 2023, 23:20, Sunija @.***> a écrit :

The OpenCL code in llama.cpp can run 4-bit generation on the GPU now, too, but it requires the model to be loaded to VRAM, which integrated GPUs don't have or have very little.

According to the task manager there's 8 GB of Shared GPU memory/GPU memory. Does that count as VRAM in that context? Or does the Intel UHD 620 just have no VRAM?

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/1761#issuecomment-1590041841, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZVYVBV3ZGZSEIYHR66QKY3XLDKRZANCNFSM6AAAAAAY7NBDEE . You are receiving this because you authored the thread.Message ID: @.***>

SlyEcho commented 1 year ago

the good thing is that you don't need to copy data vram->ram to access the data on cpu, it's just always shared by both

llama.cpp is not optimized for this yet, I don't think, so it will copy the data right now. But the UHD 620 is really slow anyway.

sunija-dev commented 1 year ago

I'm mostly wondering is: A) Is it physically impossible to increase the speed by using the GPU or... B) ...is this just a software issue, because the current libraries don't use the parallelization of the integrated GPU correctly?

And would the speed-up bring the evaluation time down from 250s to ~60s. Everything else would be almost unusable again, so I wouldn't even bother.

I guess I feel mostly confused because before I thought the generation speed would be the limiting factor (as it seems to be on dedicated GPUs), not the prompt evaluation. :/

SlyEcho commented 1 year ago

If you had a dedicated GPU, bringing down the prompt evaluation below 60s @ 1000 tokens is very much doable.

AlphaAtlas commented 1 year ago

I'm mostly wondering is: A) Is it physically impossible to increase the speed by using the GPU or... B) ...is this just a software issue, because the current libraries don't use the parallelization of the integrated GPU correctly?

And would the speed-up bring the evaluation time down from 250s to ~60s. Everything else would be almost unusable again, so I wouldn't even bother.

I guess I feel mostly confused because before I thought the generation speed would be the limiting factor (as it seems to be on dedicated GPUs), not the prompt evaluation. :/

Theoretically some IGP specific OpenCL code to "partially" offload the CPU could be written:

https://laude.cloud/post/jupyter/

As they dont have to operate out of a seperate memory pool.

aseok commented 1 year ago

building according to clblast section instructions successfully on ubuntu-x64 with intel-6402p. here's output of running train-text-from-scratch example: ggml_opencl: selecting platform: 'Intel(R) OpenCL HD Graphics' ggml_opencl: selecting device: 'Intel(R) HD Graphics 510' ggml_opencl: device FP16 support: true main: init model ... used_mem model+cache: 1083036416 bytes main: begin training GGML_ASSERT: .../llama.cpp/ggml-opencl.cpp:1343: false Aborted

ColonelPhantom commented 1 year ago

I believe that Intel oneMKL should actually run on an Intel GPU: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-0/offloading-onemkl-computations-onto-the-gpu.html

tikikun commented 10 months ago

I believe that Intel oneMKL should actually run on an Intel GPU: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-0/offloading-onemkl-computations-onto-the-gpu.html

I think we should bring this issue back, iGPU offloading at least the prompt eval is very valuable

github-actions[bot] commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.