Closed QIANXUNZDL123 closed 2 weeks ago
The OpenCL backend is basically abandoned. Use CUDA/HIP for best performance, or Vulkan if you need something that works across platforms.
@slaren @ggerganov should we at this point just remove OpenCL?
I agree, it is too broken to be useful, and any effort to fix it should go into the vulkan backend instead.
Yes, let's remove the OpenCL backend
The OpenCL backend is basically abandoned. Use CUDA/HIP for best performance, or Vulkan if you need something that works across platforms.
@slaren @ggerganov should we at this point just remove OpenCL?
Thanks for the answer, I'll try Vulkan in my project
What happened?
I used termux to compile llama.cpp with gpu. and i found that the response speed has been slower and they are all errors。
like: error response:
Who can tell me what the reason is????
Name and Version
./bin/main -t 8 -ngl 33 -m ../llama-2-7b-chat.Q4_0.gguf --color -n -1 -ins -b 256
environment: linux+termux GPU: Qualcomm 8gen2
What operating system are you seeing the problem on?
No response
Relevant log output
No response
I am using OpenCL and it is working fine here though I am using a different model.
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 85.94 MiB
llm_load_tensors: OpenCL buffer size = 4474.93 MiB
main -m /opt/mymodels/llama-2-7b.Q5_K_M.gguf -ngl 1024 -c 512 -b 1024 -n 256 --keep 48 --repeat_penalty 1.0 --color -i -t 4 -r User: -f ./llama/prompts/chat-with-bob.txt
User:Who is Elon Musk? Bob: Elon Musk is a South African-born Canadian-American entrepreneur and investor. He is the CEO and CTO of SpaceX, co-founder and CEO of Neuralink, and co-founder and CEO of Tesla, Inc.
Yes, let's remove the OpenCL backend
What about people like me using a AMD GPU?
I think abandoning a API without identifying the root cause doesn't seem like a good idea.
You can use HIP or Vulkan backends with AMD hardware.
You can use HIP or Vulkan backends with AMD hardware.
Why abandon an API that was designed for heterogenous computing? You write your code against one API and it runs everywhere.
The main issue is that the OpenCL backend is not being actively maintained because there aren't any developers with interest on it, and over time this has caused it to become hopelessly outdated. At this point, we have better alternatives for all the use cases of the OpenCL backend, so there is little reason to put any effort into. However, if you want the OpenCL backend to continue existing in llama.cpp, the best thing you could do is volunteer to maintain it. Asking for other people to volunteer is not really going to achieve anything.
The main issue is that the OpenCL backend is not being actively maintained because there aren't any developers with interest on it, and over time this has caused it to become hopelessly outdated. At this point, we have better alternatives for all the use cases of the OpenCL backend, so there is little reason to put any effort into. However, if you want the OpenCL backend to continue existing in llama.cpp, the best thing you could do is volunteer to maintain it. Asking for other people to volunteer is not really going to achieve anything.
I don't mind contributing, granted I am new to ML, but do have experience in C/C++.
You can use HIP or Vulkan backends with AMD hardware.
Why abandon an API that was designed for heterogenous computing? You write your code against one API and it runs everywhere.
I wrote the OpenCL backend. OpenCL isn't in a good position anymore, lagging behind in features and vendor support. For example memory pinning in the way CUDA does it, which is very useful for partial offload, isn't possible.
That's why I decided last year to build the Vulkan backend instead of continuing to develop the OpenCL one. Vulkan also means you have to write code once and it'll run anywhere, and it's more modern and better supported by vendors like Nvidia.
What about people like me using a AMD GPU?
Even if your AMD devices don't support HIP the Vulkan implementation should be faster than OpenCL for both prompt processing and inference. Currently the only devices that benefit from OpenCL support are old Arm embedded machines and pre-GCN AMD graphics cards.
Interestingly, for one of my use cases, CLBlast back-end can be a little bit faster than Vulkan.
model | size | params | backend | ngl | n_ubatch | test | t/s |
---|---|---|---|---|---|---|---|
llama 70B Q5_K - Medium | 46.51 GiB | 70.55 B | Vulkan | 0 | 256 | pp2048 | 14.23 ± 0.03 |
llama 70B Q5_K - Medium | 46.51 GiB | 70.55 B | Vulkan | 0 | 512 | pp2048 | 21.04 ± 0.17 |
llama 70B Q5_K - Medium | 46.51 GiB | 70.55 B | Vulkan | 0 | 1024 | pp2048 | 26.87 ± 0.08 |
model | size | params | backend | ngl | n_ubatch | test | t/s |
---|---|---|---|---|---|---|---|
llama 70B Q5_K - Medium | 46.51 GiB | 70.55 B | OpenCL | 0 | 256 | pp2048 | 13.58 ± 0.01 |
llama 70B Q5_K - Medium | 46.51 GiB | 70.55 B | OpenCL | 0 | 512 | pp2048 | 21.05 ± 0.02 |
llama 70B Q5_K - Medium | 46.51 GiB | 70.55 B | OpenCL | 0 | 1024 | pp2048 | 28.02 ± 0.02 |
While on default settings the speed is the same, OpenCL seems to benefit more from increased batch size.
Interestingly, for one of my use cases, CLBlast back-end can be a little bit faster than Vulkan.
That's cause no offloaded layers means it's the one case where OpenCL has feature parity with Vulkan, but since it uses CLBlast (a fully optimized BLAS library), it is faster than my matmul shaders. That could be fixed by some optimization, eventually.
Interestingly, for one of my use cases, CLBlast back-end can be a little bit faster than Vulkan.
model size params backend ngl n_ubatch test t/s llama 70B Q5_K - Medium 46.51 GiB 70.55 B Vulkan 0 256 pp2048 14.23 ± 0.03 llama 70B Q5_K - Medium 46.51 GiB 70.55 B Vulkan 0 512 pp2048 21.04 ± 0.17 llama 70B Q5_K - Medium 46.51 GiB 70.55 B Vulkan 0 1024 pp2048 26.87 ± 0.08 model size params backend ngl n_ubatch test t/s llama 70B Q5_K - Medium 46.51 GiB 70.55 B OpenCL 0 256 pp2048 13.58 ± 0.01 llama 70B Q5_K - Medium 46.51 GiB 70.55 B OpenCL 0 512 pp2048 21.05 ± 0.02 llama 70B Q5_K - Medium 46.51 GiB 70.55 B OpenCL 0 1024 pp2048 28.02 ± 0.02 While on default settings the speed is the same, OpenCL seems to benefit more from increased batch size.
What is your environment? AMD GPU or ARM GPU? I'm using OpenCL in Termux, but it's very slow, and my environment is 8Gen2
AMD dGPU
Yes, let's remove the OpenCL backend
I was looking through the source code for ggml-openCL.cpp is there a particular reason models are dequantized to FP32 before performing matmul calculations? Why can't they be dequantized to FP16 or lower?
Yes, let's remove the OpenCL backend
I was looking through the source code for ggml-openCL.cpp is there a particular reason models are dequantized to FP32 before performing matmul calculations? Why can't they be dequantized to FP16 or lower?
Only few GPUs support 16-bit floats in OpenCL, for example all Nvidia GPUs don't. I tried it once, I think, but it didn't help with speeds either. Dunno what you mean with "or lower".
@ggerganov @0cc4m Thank you very much for all your effort to make llama.cpp and OpenCL happen. @0cc4m We once contacted while locating problems for A770 https://github.com/CNugteren/CLBlast/issues/533. For 16 bit support, NVIDIA told me in a bug report that they are not fully compliant with cl_khr_fp16 with built-in functions such as sine cosine. Hence there is no detectable flag cl_khr_fp16. But they support to use #pragma OPENCL EXTENSION cl_khr_fp16 : enable if we like FP16 without those functions. https://us.download.nvidia.com/Windows/522.25/522.25-win11-win10-release-notes.pdf I would sincerely ask keeping OpenCL backend at the moment as a tunned GEMM is still waiting new opensource implementations in metal/vulkan/hip. (After some investigations, GEMM is not as that easy as it seems. I tried a simple Mutual Information Neural estimator PyTorch program (which is one of my areas) using ROCm under Linux, a 7900xtx can't even compete with 4060Ti. I guess mainly due to poor implementation of GEMM by AMD like they once did with CLBLAS.) I wish I had the time to maintain OpenCL backend, but I could not. I was involved in Projects like CLBLAST, VKFFT, PyTorch DLprimitives, as well as Ocatave OCL. My own project is https://sourceforge.net/projects/octave-ocl-extra/ with nothing but to keep OpenCL alive until some really good alternative is out. I wish us to support more opensource alternative rather than converging to proprietary library and vender specific language like Blender did and then blame OpenCL. https://cgicoffee.com/blog/2021/11/blender-3-removes-opencl-improves-cuda-optix-support Meanwhile, companies like Adobe , Wolfram mathematica and Qualcomm are supporting OpenCL in applications and mobile devices.
Best wishes, Jinchaun
I wish I had the time to maintain OpenCL backend, but I could not.
The trouble is exactly that no one worked on CLBlast back-end enough to keep it in shape.
Yes, let's remove the OpenCL backend
Now without OpenCL there's no optimization for regular Windows computers with TPUs almost at all. OpenBlas doesn't help much. https://github.com/ggerganov/whisper.cpp/issues/2303
Yes, let's remove the OpenCL backend
Now without OpenCL there's no optimization for regular Windows computers with TPUs almost at all. OpenBlas doesn't help much. ggerganov/whisper.cpp#2303
OpenCL never ran on TPUs, only on GPUs, same as Vulkan.
This issue was closed because it has been inactive for 14 days since being marked as stale.
What happened?
I used termux to compile llama.cpp with gpu. and i found that the response speed has been slower and they are all errors。
like: error response:
Who can tell me what the reason is????
Name and Version
./bin/main -t 8 -ngl 33 -m ../llama-2-7b-chat.Q4_0.gguf --color -n -1 -ins -b 256
environment: linux+termux GPU: Qualcomm 8gen2
What operating system are you seeing the problem on?
No response
Relevant log output
No response