Closed iamhumanipromise closed 4 months ago
@iamhumanipromise Sorry, it's not clear of the description of the idea.
As my understanding of this issue, you hope use IPEX-LLM as backend to support Intel GPU. It's more like use TensorFlow/Pytorch as backend too.
If yes, will it be quicker than TensorFlow/Pytorch? Why not use TenfsorFlow/Pytorch directly?
IPEX LLM already supports llama.cpp I think: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html
Also PyTorch's IPEX and openxla both use Intel OneAPI SYCL which is used by llama.cpp's SYCL backend. So, it is already supported.
IPEX LLM already supports llama.cpp I think: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html
What IPEX-LLM has is a fork of llama.cpp and some other projects that has optimizations that have not been upstreamed here for one reason or another. I'm a current user of it and typically, usage of it doubles the speed of upstream. However, it can't support mixed GPU + CPU scenarios which is the main issue and new model support may take a while to filter over. Hence why I have upstream and the fork for my use cases.
SYCL backend is still focusing on the missed functions to support more features and model. Performance optimization will be handled in next. But we have less spare time to contribute to it. I think the progress won't be quickly. Because SYCL backend will cover more Intel GPUs: Max, Flex, Arc and iGPU in MTL. So the performance optimization need to be verified on all GPUs and make sure not to impact any of them.
which issue/pull would you recommend we follow for latest info about the SYCL branch @NeoZhangJianyu?
EDIT: I take it that [SYCL] Refactor would be it?
IPEX LLM already supports llama.cpp I think: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html
What IPEX-LLM has is a fork of llama.cpp and some other projects that has optimizations that have not been upstreamed here for one reason or another. I'm a current user of it and typically, usage of it doubles the speed of upstream. However, it can't support mixed GPU + CPU scenarios which is the main issue and new model support may take a while to filter over. Hence why I have upstream and the fork for my use cases.
they seem to be keeping it reasonably up to date, as their published version of LlamaCPP-IPEX is using a week old version as its base line so far. I do hope they provide a bit of clarity about how to go about actually pulling new versions of their IPEX branch though... also Interesting about the lack of GPU overflow/partial offload capability, I was not aware of that...
they seem to be keeping it reasonably up to date, as their published version of LlamaCPP-IPEX is using a week old version as its base line so far. I do hope they provide a bit of clarity about how to go about actually pulling new versions of their IPEX branch though... also Interesting about the lack of GPU overflow/partial offload capability, I was not aware of that...
Going to respond to this since the other comment was deleted from another person from Intel. I think it should be working but for some reason, it fails as it forces you to fully offload in the case of something like Llama 3 8B or faults on an illegal instruction for something bigger like Llama 3 70B or Command-R which had support just added from what I tested. I haven't upgraded in a while, so I'll probably be rechecking this before opening a ticket in the other repository to fix this since upstream works but the fork doesn't in this situation.
they seem to be keeping it reasonably up to date, as their published version of LlamaCPP-IPEX is using a week old version as its base line so far. I do hope they provide a bit of clarity about how to go about actually pulling new versions of their IPEX branch though... also Interesting about the lack of GPU overflow/partial offload capability, I was not aware of that...
Going to respond to this since the other comment was deleted from another person from Intel. I think it should be working but for some reason, it fails as it forces you to fully offload in the case of something like Llama 3 8B or faults on an illegal instruction for something bigger like Llama 3 70B or Command-R which had support just added from what I tested. I haven't upgraded in a while, so I'll probably be rechecking this before opening a ticket in the other repository to fix this since upstream works but the fork doesn't in this situation.
Do they have a fork on github for llamacpp? I actually haven't found it, I just installed from the readingthedocs site that I linked to. hell, I don't actually know how to go about updating the install; just have a hypothesis about what I need to do.
Most of the stuff for IPEX-LLM has been upstreamed into llama.cpp. IPEX-LLM llama.cpp vs llama.cpp (upstream) is basically the same perf at this point. I think the question shouldn't be for IPEX-LLM support, but for SYCL support using upstream llama.cpp (which the IPEX-LLM team is already upstreaming into llama.cpp)
Also note that this doesn't require IPEX itself. IPEX-LLM does, but the native SYCL support does not.
And yes I work for Intel and yes I'm talking to IPEX-LLM teams and others :)
Most of the stuff for IPEX-LLM has been upstreamed into llama.cpp. IPEX-LLM llama.cpp vs llama.cpp (upstream) is basically the same perf at this point. I think the question shouldn't be for IPEX-LLM support, but for SYCL support using upstream llama.cpp (which the IPEX-LLM team is already upstreaming into llama.cpp)
Also note that this doesn't require IPEX itself. IPEX-LLM does, but the native SYCL support does not.
And yes I work for Intel and yes I'm talking to IPEX-LLM teams and others :)
With a Q6_K quant of a llama 3 that had been quanted from a BF16 GGUF with the correct pre-tokeniser and EOS token, I get 30 tokens per second at the beginning of context with the IPEX branch compared to 17 tokens per second with the llamacpp-SYCL version b2885. that's actually quite a stark difference in performance as I see it and I feel that if it's possible, it'd be awesome to see the performance of the IPEX branch becoming generally available from the standard SYCL branch of llamacpp, as installing the IPEX branch was troublesome.
So I'll be waiting with bated breath, I guess.
Yeah that’s fair. It definitely depends on model size etc.
Will work with the team to try to upstream asap as we can.
which issue/pull would you recommend we follow for latest info about the SYCL branch @NeoZhangJianyu?
EDIT: I take it that [SYCL] Refactor would be it?
IPEX LLM already supports llama.cpp I think: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html
What IPEX-LLM has is a fork of llama.cpp and some other projects that has optimizations that have not been upstreamed here for one reason or another. I'm a current user of it and typically, usage of it doubles the speed of upstream. However, it can't support mixed GPU + CPU scenarios which is the main issue and new model support may take a while to filter over. Hence why I have upstream and the fork for my use cases.
they seem to be keeping it reasonably up to date, as their published version of LlamaCPP-IPEX is using a week old version as its base line so far. I do hope they provide a bit of clarity about how to go about actually pulling new versions of their IPEX branch though... also Interesting about the lack of GPU overflow/partial offload capability, I was not aware of that...
I suggest to use the latest code in master branch. There is no obvious issue in SYCL backend recently.
Can also attest to differences in SYCL build (as outlined in https://github.com/ggerganov/llama.cpp/blob/master/README-sycl.md) and the IPEX-LLM branch. Intel Arc A770M, Llama 3 8B Q8_0, full offload with the prompt "Building a website can be done in 10 simple steps:\nStep 1:"
; Win11 and WSL2 Ubuntu SYCL builds get in the 4.3-4.9 tok/s range, while WSL2 Ubuntu IPEX build from their branch gets 6.5-7.1 tok/s. Looking forward to upstreamed IPEX support!
This issue was closed because it has been inactive for 14 days since being marked as stale.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
I have found this closed issue where someone manually (?how?) implemented IPEX-LLM. However, looking forward to native IPEX-LLM support for Intel Xe iGPUs + Intel Arc dGPUs on Windows and Linux
https://github.com/ggerganov/llama.cpp/issues/7042
TL;DR is IPEX-LLM now provides a C++ interface, which can be used as a backend for running llama.cpp on Intel GPUs. Incorporating this interface into llama.cpp would allow for leveraging the optimized performance of IPEX-LLM.
Motivation
Intel Xe graphics launched in 2020. Flex, Max Datacenter and Arc Consumer cards for laptop and desktop launched in 2022. This is a lot of devices in production/circulation.
This would "permit" llama.cpp users to utilize their integrated Xe GPUs and dedicated Arc GPUs, Datacenter Flex and Max cards with llama.cpp on BOTH Windows and Linux natively (without a confusing manual build).
Possible Implementation
The implementation of native Intel IPEX-LLM support would be something like... Integrate --> Test --> Document --> Release.
Full manual/guide: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html Full verified model list: https://ipex-llm.readthedocs.io/en/latest/#verified-models Github: https://github.com/intel-analytics/ipex-llm
The "owners" of this process will be the devs and engineers here; in this Github (simple nerds such as myself do not have the expertise to tackle something like this... even locally)
For example from the documentation it looks like this would be create a new conda envioronment --> set up environment --> configure oneapi variables --> update cmakelists.txt or makefile with paths to IPEX-LLM library and headers --> then ??map llama.cpp functionalities to ipex apis (which Intel has already done).
The "owners" of this step would be wide-ranging overall.
Documentation and Examples: Someone would have to "own" updating the documentation to guide users on how to enable and use the new IPEX-LLM support. Providing examples and quickstart guides can significantly help; but ultimately for independent users it will be up to them and then for GUI and TUI/CLI frontends, the documentation will need to be updated by them.
Release After all of this has been done, going forward to launch woot woot.
I'm sure there are many, many steps I am missing here. Just wanted to "kick off" the process.