Mozilla-Ocho / Memory-Cache

MemoryCache is an experimental development project to turn a local desktop environment into an on-device AI agent
Mozilla Public License 2.0
540 stars 23 forks source link

Update to use GPU-accelerated hardware instead of CPU-bound with gpt4all #6

Open misslivirose opened 8 months ago

misslivirose commented 8 months ago

Memory Cache should use a GPU that is available to do inference in order to speed up performance of queries and deriving insights from documents.

What I tried so far

I spent a few days last week exploring the differences between the primordial privateGPT version and latest. One of the major differences is that the newer project updates include support for GPU inference for llama and gpt4all, but the challenge that I ran into with the newer version is that moving from the older groovy.ggml model (which is no longer supported given that privateGPT now uses the .gguf format) to llama doesn't have the same results when ingesting the same local file store and querying.

This might be a matter of how RAG is implemented, something about how I set things up on my local machine, or a function of model choice.

I've lazily tried to see if this can be resolved through dependency changes but I haven't had luck getting to a version that runs that supports .ggml and GPU acceleration together. From what I can tell, Nomic introduced a version of gpt4all that works on GPU in 2.4 (latest is 2.5+) but it's unclear if there's a way to get this working cleanly with minimal changes to how my fork of privateGPT uses langchain to import the gpt4all package. It is unclear to me if this works on Ubuntu or if it's only Vulkan APIs, I need to do some additional investigation.

I did get CUDA installed and verified that my GPU is properly detected and set up to run the sample projects provided by Nvidia.

What's next


I've been using a highly subjective test to evaluate:

Prompt: "What is the meaning of a life well-lived?"

The answer for primordial privateGPT+groovy that has been augmented on my local files answers this question with a combination of "technology and community" consistently. No other combination of model/project has replicated that consistently.

misslivirose commented 8 months ago

Update: GPT4All 2.5.2 with snoozy fails the "life well-lived" test.

misslivirose commented 8 months ago

My quick attempt at trying to convert groovy from ggml to gguf using the llama.cpp utility did not work - it looks like this is a known impact of the swap to gguf, but I didn't have time today to investigate further.

I did find the model card for GPT4All-J to be helpful in explaining the specific iterations that led to groovy. New idea to test:

tomjorquera commented 6 months ago

Hey @misslivirose , I got curious about making the project work on GPU, so I spend some my Sunday evening investigating the issue.

I managed to make the whole thing to work with GPU by updating some of the dependencies and make some required changes. Feel free to look at my branch on and to pull as you please.

It's not all rosy sadly, I've hit some snags along the road (some more of that below)

How to use with GPU

My changes introduce the USE_GPU env variable that controls GPU execution, as well as an additional MODEL_N_GPU_LAYERS that allows to choose the number of layers to run on the GPU (with LlamaCpp only, GPT4All doesn't support such thing to my knowledge).

I tested the changes with both LLamaCpp and GPT4All models both with and without GPU and it seems to work well on my side.

Installing llama-cpp-python properly

One very important note however, is that you must set the correct env variable when installing llama-cpp-python the first time. So if you're ok with nucking your virtualenv you can simply do:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install -r requirements.txt

That will enable CUDA support.

If you've already installed llama-cpp-python without this env variable, reinstalling it will not work, as the cache will not be rebuilt. If for some reason you do not want to nuke your venv, the magic command to force reinstallation is:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir

Issue: Chroma DB needs to be recreated

Sadly, I had to update Chroma to a version with breaking changes. So you will need to recreated the DB.

Also the Chroma API changed a lot in the meatime, so I had to do some modifications to to make it work.

Personally, I would do away with the use of LangChain in and use the chroma bindings directly. The API is really straightforward, and the LangChain wrapper only adds complexity without any additional value here.

Issue: GPT4All streaming is broken in latest LangChain FIXED!

LangChain had multiple refactorings, and it seems at some point streaming support for GPT4All broke. I found a way to re-enable it and created an issue with a PR for that at, but I'm not optimistic it will be fixed soon, given that:

I don't have a satisfying solution for that (except by doing away with LangChain completely).

EDIT: scratch all that, my proposed change has just been merged. So once it is released the issue should be fixable by adding streaming=True to the GPT4All constructor. The LangChain dev also shared a nice solution to do away with callback in the latest version

EDIT2: My fix has been released with v0.1.3. I updated my branch with latest version (0.1.4) and re-enabled streaming with GPT4All.

Issue: GPT4All-j still doesn't work

So as you mentioned previously GPT4All-j has not be migrated to gguf.

Looking around in the GPT4All-j repo I noticed a promising script gpt4all-backend/script/, but it was badly broken.

I reported the issue and proposed a fix at managed to that to generate a (seemingly) valid gguf file from the original GPT4All-J model (downloaded from HF).

However trying to quantize the model using gpt4all-backend/llama.cpp-mainline/quantize fails, and the unquantized model is simply too big for me to try.

Potential solution: adding support for HF models directly

So sadly I didn't find a way to make it work. What can be done however is adding support to HF models in and run the original project directly. This will probably not give you the same results than the model you were using, but if you really want to continue using GPT4All-J specifically that could be a solution.

Just for fun, I did that using HuggingFacePipeline in my branch, which build on previous one and adds support for the "HF" MODEL_TYPE.

For the MODEL_PATH, you can give a local project downloaded from HF, or an HF repo identifiant. In the latter case, it will fetch the files from HF on first use (and since nomic-ai/gpt4all-j is public, you don't need to set an auth token). I've set it so it does 4bit quantization on the fly if you enable GPU support (probably need to be something more configurable however).

Note however that it will not do any quantization whatsoever if running on CPU, so it's probably basically useless to try do use it that way.

Let me know if you take it for a spin :smile:

Fun project to work on. Hope that helps!

tomjorquera commented 6 months ago

Update: My fix for GPT4All streaming has been released with langchain v0.1.3. I updated my branch with latest version (0.1.4) and to make use of the relevant option. So now GPT4All streaming works again :slightly_smiling_face: