ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.86k stars 9.46k forks source link

When using GPU (OpenCL), the reply speed is slower and all replies are incorrect?? #7661

Closed QIANXUNZDL123 closed 2 weeks ago

QIANXUNZDL123 commented 4 months ago

What happened?

I used termux to compile llama.cpp with gpu. and i found that the response speed has been slower and they are all errors。

like: image error response: image

Who can tell me what the reason is????

Name and Version

./bin/main -t 8 -ngl 33 -m ../llama-2-7b-chat.Q4_0.gguf --color -n -1 -ins -b 256

environment: linux+termux GPU: Qualcomm 8gen2

What operating system are you seeing the problem on?

No response

Relevant log output

No response

JohannesGaessler commented 4 months ago

The OpenCL backend is basically abandoned. Use CUDA/HIP for best performance, or Vulkan if you need something that works across platforms.

@slaren @ggerganov should we at this point just remove OpenCL?

slaren commented 4 months ago

I agree, it is too broken to be useful, and any effort to fix it should go into the vulkan backend instead.

ggerganov commented 4 months ago

Yes, let's remove the OpenCL backend

QIANXUNZDL123 commented 4 months ago

The OpenCL backend is basically abandoned. Use CUDA/HIP for best performance, or Vulkan if you need something that works across platforms.

@slaren @ggerganov should we at this point just remove OpenCL?

Thanks for the answer, I'll try Vulkan in my project

acbits commented 4 months ago

What happened?

I used termux to compile llama.cpp with gpu. and i found that the response speed has been slower and they are all errors。

like: image error response: image

Who can tell me what the reason is????

Name and Version

./bin/main -t 8 -ngl 33 -m ../llama-2-7b-chat.Q4_0.gguf --color -n -1 -ins -b 256

environment: linux+termux GPU: Qualcomm 8gen2

What operating system are you seeing the problem on?

No response

Relevant log output

No response

I am using OpenCL and it is working fine here though I am using a different model.

llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    85.94 MiB
llm_load_tensors:     OpenCL buffer size =  4474.93 MiB

main -m /opt/mymodels/llama-2-7b.Q5_K_M.gguf -ngl 1024 -c 512 -b 1024 -n 256 --keep 48 --repeat_penalty 1.0 --color -i -t 4 -r User: -f ./llama/prompts/chat-with-bob.txt

User:Who is Elon Musk? Bob: Elon Musk is a South African-born Canadian-American entrepreneur and investor. He is the CEO and CTO of SpaceX, co-founder and CEO of Neuralink, and co-founder and CEO of Tesla, Inc.

acbits commented 4 months ago

Yes, let's remove the OpenCL backend

What about people like me using a AMD GPU?

I think abandoning a API without identifying the root cause doesn't seem like a good idea.

slaren commented 4 months ago

You can use HIP or Vulkan backends with AMD hardware.

acbits commented 4 months ago

You can use HIP or Vulkan backends with AMD hardware.

Why abandon an API that was designed for heterogenous computing? You write your code against one API and it runs everywhere.

slaren commented 4 months ago

The main issue is that the OpenCL backend is not being actively maintained because there aren't any developers with interest on it, and over time this has caused it to become hopelessly outdated. At this point, we have better alternatives for all the use cases of the OpenCL backend, so there is little reason to put any effort into. However, if you want the OpenCL backend to continue existing in llama.cpp, the best thing you could do is volunteer to maintain it. Asking for other people to volunteer is not really going to achieve anything.

acbits commented 4 months ago

The main issue is that the OpenCL backend is not being actively maintained because there aren't any developers with interest on it, and over time this has caused it to become hopelessly outdated. At this point, we have better alternatives for all the use cases of the OpenCL backend, so there is little reason to put any effort into. However, if you want the OpenCL backend to continue existing in llama.cpp, the best thing you could do is volunteer to maintain it. Asking for other people to volunteer is not really going to achieve anything.

I don't mind contributing, granted I am new to ML, but do have experience in C/C++.

0cc4m commented 4 months ago

You can use HIP or Vulkan backends with AMD hardware.

Why abandon an API that was designed for heterogenous computing? You write your code against one API and it runs everywhere.

I wrote the OpenCL backend. OpenCL isn't in a good position anymore, lagging behind in features and vendor support. For example memory pinning in the way CUDA does it, which is very useful for partial offload, isn't possible.

That's why I decided last year to build the Vulkan backend instead of continuing to develop the OpenCL one. Vulkan also means you have to write code once and it'll run anywhere, and it's more modern and better supported by vendors like Nvidia.

netrunnereve commented 4 months ago

What about people like me using a AMD GPU?

Even if your AMD devices don't support HIP the Vulkan implementation should be faster than OpenCL for both prompt processing and inference. Currently the only devices that benefit from OpenCL support are old Arm embedded machines and pre-GCN AMD graphics cards.

shibe2 commented 4 months ago

Interestingly, for one of my use cases, CLBlast back-end can be a little bit faster than Vulkan.

model size params backend ngl n_ubatch test t/s
llama 70B Q5_K - Medium 46.51 GiB 70.55 B Vulkan 0 256 pp2048 14.23 ± 0.03
llama 70B Q5_K - Medium 46.51 GiB 70.55 B Vulkan 0 512 pp2048 21.04 ± 0.17
llama 70B Q5_K - Medium 46.51 GiB 70.55 B Vulkan 0 1024 pp2048 26.87 ± 0.08
model size params backend ngl n_ubatch test t/s
llama 70B Q5_K - Medium 46.51 GiB 70.55 B OpenCL 0 256 pp2048 13.58 ± 0.01
llama 70B Q5_K - Medium 46.51 GiB 70.55 B OpenCL 0 512 pp2048 21.05 ± 0.02
llama 70B Q5_K - Medium 46.51 GiB 70.55 B OpenCL 0 1024 pp2048 28.02 ± 0.02

While on default settings the speed is the same, OpenCL seems to benefit more from increased batch size.

0cc4m commented 4 months ago

Interestingly, for one of my use cases, CLBlast back-end can be a little bit faster than Vulkan.

That's cause no offloaded layers means it's the one case where OpenCL has feature parity with Vulkan, but since it uses CLBlast (a fully optimized BLAS library), it is faster than my matmul shaders. That could be fixed by some optimization, eventually.

QIANXUNZDL123 commented 4 months ago

Interestingly, for one of my use cases, CLBlast back-end can be a little bit faster than Vulkan.

model size params backend ngl n_ubatch test t/s llama 70B Q5_K - Medium 46.51 GiB 70.55 B Vulkan 0 256 pp2048 14.23 ± 0.03 llama 70B Q5_K - Medium 46.51 GiB 70.55 B Vulkan 0 512 pp2048 21.04 ± 0.17 llama 70B Q5_K - Medium 46.51 GiB 70.55 B Vulkan 0 1024 pp2048 26.87 ± 0.08 model size params backend ngl n_ubatch test t/s llama 70B Q5_K - Medium 46.51 GiB 70.55 B OpenCL 0 256 pp2048 13.58 ± 0.01 llama 70B Q5_K - Medium 46.51 GiB 70.55 B OpenCL 0 512 pp2048 21.05 ± 0.02 llama 70B Q5_K - Medium 46.51 GiB 70.55 B OpenCL 0 1024 pp2048 28.02 ± 0.02 While on default settings the speed is the same, OpenCL seems to benefit more from increased batch size.

What is your environment? AMD GPU or ARM GPU? I'm using OpenCL in Termux, but it's very slow, and my environment is 8Gen2

shibe2 commented 4 months ago

AMD dGPU

jetro30087 commented 3 months ago

Yes, let's remove the OpenCL backend

I was looking through the source code for ggml-openCL.cpp is there a particular reason models are dequantized to FP32 before performing matmul calculations? Why can't they be dequantized to FP16 or lower?

0cc4m commented 3 months ago

Yes, let's remove the OpenCL backend

I was looking through the source code for ggml-openCL.cpp is there a particular reason models are dequantized to FP32 before performing matmul calculations? Why can't they be dequantized to FP16 or lower?

Only few GPUs support 16-bit floats in OpenCL, for example all Nvidia GPUs don't. I tried it once, I think, but it didn't help with speeds either. Dunno what you mean with "or lower".

tangjinchuan commented 3 months ago

@ggerganov @0cc4m Thank you very much for all your effort to make llama.cpp and OpenCL happen. @0cc4m We once contacted while locating problems for A770 https://github.com/CNugteren/CLBlast/issues/533. For 16 bit support, NVIDIA told me in a bug report that they are not fully compliant with cl_khr_fp16 with built-in functions such as sine cosine. Hence there is no detectable flag cl_khr_fp16. But they support to use #pragma OPENCL EXTENSION cl_khr_fp16 : enable if we like FP16 without those functions. https://us.download.nvidia.com/Windows/522.25/522.25-win11-win10-release-notes.pdf I would sincerely ask keeping OpenCL backend at the moment as a tunned GEMM is still waiting new opensource implementations in metal/vulkan/hip. (After some investigations, GEMM is not as that easy as it seems. I tried a simple Mutual Information Neural estimator PyTorch program (which is one of my areas) using ROCm under Linux, a 7900xtx can't even compete with 4060Ti. I guess mainly due to poor implementation of GEMM by AMD like they once did with CLBLAS.) I wish I had the time to maintain OpenCL backend, but I could not. I was involved in Projects like CLBLAST, VKFFT, PyTorch DLprimitives, as well as Ocatave OCL. My own project is https://sourceforge.net/projects/octave-ocl-extra/ with nothing but to keep OpenCL alive until some really good alternative is out. I wish us to support more opensource alternative rather than converging to proprietary library and vender specific language like Blender did and then blame OpenCL. https://cgicoffee.com/blog/2021/11/blender-3-removes-opencl-improves-cuda-optix-support Meanwhile, companies like Adobe , Wolfram mathematica and Qualcomm are supporting OpenCL in applications and mobile devices.

Best wishes, Jinchaun

shibe2 commented 3 months ago

I wish I had the time to maintain OpenCL backend, but I could not.

The trouble is exactly that no one worked on CLBlast back-end enough to keep it in shape.

thewh1teagle commented 2 months ago

Yes, let's remove the OpenCL backend

Now without OpenCL there's no optimization for regular Windows computers with TPUs almost at all. OpenBlas doesn't help much. https://github.com/ggerganov/whisper.cpp/issues/2303

0cc4m commented 2 months ago

Yes, let's remove the OpenCL backend

Now without OpenCL there's no optimization for regular Windows computers with TPUs almost at all. OpenBlas doesn't help much. ggerganov/whisper.cpp#2303

OpenCL never ran on TPUs, only on GPUs, same as Vulkan.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been inactive for 14 days since being marked as stale.