Let embedding model run on GPU

InAnYan / jabref

Graphical Java application for managing BibTeX and biblatex (.bib) databases

https://devdocs.jabref.org

MIT License

0 stars 0 forks source link

Let embedding model run on GPU #71

Open ThiloteE opened 3 months ago

ThiloteE commented 3 months ago

Historical "what the fuck" is available at https://github.com/JabRef/jabref/pull/11430#issuecomment-2209278098

Advantages:

For LLMs, GPU is much faster than CPU. 10 times + X faster, depending on hardware.

Disadvantages:

Please correct me, if I am wrong, but I expect dependencies for GPU backend are required. (E.g. llama.cpp, Nvidia (drivers, Cuda toolkit libraries), Vulkan, RoCm, SYCL, ...)

ThiloteE commented 3 months ago

If implemented, let users choose backend and hardware (CPU vs GPU / GPU1 or GPU2 or GPU3 ...) choose in preferences.

InAnYan commented 2 months ago

Currently in langchain4j in-process embedding models (meaning they run locally on a computer) are run only on CPU. There is an issue to run embedding models on GPU, but it's not resolved.

In order to implement this we have these choices:

Wait for implementation in langchain4j: simpler to develop, better from architectural point of view.
Write fix ourselves for langchain4j: good.
Use external modules and write all the support code ourselves in JabRef: the fastest way

It's a very good idea, we should look into it, but probably a bit later, when we finally release AI chat and, maybe, add summarization.

I'll mark the issue as low-priority, but it's only low priority for this context: week 1 and first release

InAnYan commented 2 months ago

Actually, no, I'll remove low-priority, and won't assign a milestone

koppor commented 2 months ago

I collect it at the final "anything else" Milestone "final polishing" 😅

ThiloteE commented 2 months ago

GPU support (for embedding models) with llama.cpp:

ThiloteE commented 2 months ago

GPU support with Deep Java library: https://docs.djl.ai/engines/onnxruntime/onnxruntime-engine/index.html#install-gpu-package. Unfortunately they also use Microsofts ONNX, which seems to be very slow. I assume models need to be compatible with ONNX too, because not many models are uploaded on Huggingface in ONNX file format!

koppor commented 1 month ago

At least, one can paint everything blue in the CPU utilization

ThiloteE commented 5 days ago

One solution to providing support for GPU acceleration for LLMs (NOT necessarily embedding models!) is to provide proper support for OpenAI API. See issue https://github.com/JabRef/jabref/issues/11872. Using external applications like llama.cpp, GPT4All, LMStudio, Ollama, Jan, KobolCPP etc. that already provide support for GPU acceleration, there is no need to add and maintain this feature in JabRef. It would still be nice to have GPU acceleration for embedding models though. Maybe do it like Koboldcpp and only provide a Vulkan backend, which is much much smaller than a Cuda backend (\~1.5 GB in pytorch; 200 - 500 MB in llama.cpp).