abgulati / LARS

An application for running LLMs locally on your device, with your documents, facilitating detailed citations in generated responses.
https://www.youtube.com/watch?v=Mam1i86n8sU&ab_channel=AbheekGulati
GNU Affero General Public License v3.0
473 stars 34 forks source link

Problem with the "setup_for_llama_cpp_response" method #28

Closed jessemcg closed 1 week ago

jessemcg commented 2 weeks ago

Thank you so much for your hard work. I feel extremely close to getting this working. The llama model loads and the pdf processing is working great. But when I ask a question, it returns a "localhost:5000" error and says: "There was an error when setting up the streaming response in the method /setup_for_llama_cpp_response, more details can be viewed in the browser's console." The browser seems to show that it was getting "text/html" instead of json. (see below.) I believe it might be a flask issue, but I don't understand flask well enough to figure out the problem.

I am also including a screenshot from the server log, which says something about getting an unexpected keyword argument "embedding_fn"

I am on Fedora 39/Linux with Cuda GPU. All dependencies were installed in a virtual environment with python 3.11.9. I was using gemma-2-9b (and I chose the corresponding prompt template), but I also tried mistral with the same result. Any suggestions on how to fix this would be much appreciated.

Screenshot from 2024-09-27 09-25-45 Screenshot from 2024-09-27 09-27-13

abgulati commented 2 weeks ago

Hi thanks for your supportive comment and efforts in getting this running! I've really intended to make it as easy to run as possible: clone, install requirements and launch. Unfortunately I'm getting a bunch of comments complaining about issues with thr requirements.txt files. I believe the issue in your screenshot is one of missing dependencies too.

I honestly haven't had a chance to test on Fedora/DNF. But give me some time, I'll investigate this, update the requirements.txt files and update the container so deployment will be easier since you'll be able to build the container. The current Docker build is outdated unfortunately, and does not contain the improvements of LARS v2.

I'll drop an update here once I've looked into and resolved this 🍻

jessemcg commented 2 weeks ago

Awesome, thanks.

abgulati commented 2 weeks ago

@jessemcg A new release with updated requirements is now available, please give it a spin and report your experience: https://github.com/abgulati/LARS/releases/tag/v2.0-beta7

Thank you!

abgulati commented 2 weeks ago

@jessemcg I have now verified the requirements installation on Windows and Ubuntu: https://github.com/abgulati/LARS/blob/v2.0-beta8/requirements_linux.txt

YMMV on Fedora/DNF & Mac though!

Several unnecessary reqs have been removed and version-specs for a couple too. The encoding has also been updated to UTF-8 so you can use Nano for any required edits.

Few more refinements merged in as well, resulting in v2.0-beta8: a fast follow-up to today morning's emergency v2.0-beta7 update.

Do re-try and drop an update.

jessemcg commented 1 week ago

It worked on Fedora 39 Linux. Thank you so much for the quick update. There was a very minor obstacle in the beginning where it did not automatically create the base directory, but that is easily fixed by manually creating it and updating the json config file.

In case this is helpful to others, I found that setting up a virtual environment and installing the dependencies with uv was much faster than standard pip. I am attaching a toml file with the linux dependencies for use with uv in case anyone is interested. For some reason, it did not let me upload an actual toml file so you will need to rename it and get rid of the .txt at the end. With uv, you basically just run: uv init (new project name), place the LARS project in there, replace the .toml file, run uv sync, then CD to the web_app directory and run: uv run app.py.

There was one part where I had to run "export CXX=g++" when syncing dependencies with uv.

Thank again.

pyproject.toml.txt

abgulati commented 1 week ago

Thank you so much for the update and for your excellent contribution @jessemcg !

It's contributions like these that encourage open-source work 🍻

Very curious to hear of your experience: how is LARS running? Is everything working okay?

Sincere thanks again!

jessemcg commented 1 week ago

Everything is working very well. I appreciate how it automatically detects if a llama.cpp server is already running on port 8080, then just uses that if it is. This keeps my VRAM from filling up. My use case is legal transcripts, and the default RAG pipeline is creating very quick and accurate responses. Some of the highlights don't always make sense, but it is still a useful feature. I haven't had time to experiment with the more detailed settings, but I am glad they are there. Great job.

abgulati commented 1 week ago

That's fantastic to hear, thanks @jessemcg !

Do give HF-Waitress a spin too, it makes running new models off the hub very easy: just copy the model_id and click Add! This way, you can run new models as soon as they're out without waiting for llama.cpp and GGUF support.

🍻

abgulati commented 1 day ago

@jessemcg Thank you so much for your generous donation today!! Truly, I'm extremely grateful & humbled.

I'd love to give you access to my LARS-Enterprise private repository, which, amongst a host of UI and QoL updates (including deletion and renaming of chats in the sidebar) contains the following major feature updates:

  1. Support for the Llama3.2-Vision LLM to visually analyse documents in any format in a live chat: separate from uploading docs to the VectorDB for RAG, you can attach a document/Image in any format directly to your prompt and have Llama3.2-Vision look at it and respond. I've also implemented full conversational flow for this model, so you can upload files for visual analysis, ask follow-up questions and even use RAG, all with a single instance of the Vision LLM!
  2. ⁠Image generation: FLUX is now supported!
  3. ⁠RAG-Citations pipeline has been significantly improved: BM25-Indexing is introduced into LARS to augment the existing embeddings + re-ranking pipeline

And it’s all still 100% local: no data for any of the above leaves your machine! Multi-modal capabilites have been made possible entirely thanks to the flexibility of my HF-Waitress LLM server.

Do let me know if you're interested and I'll add you to my private repo!

Thanks so much once again!

jessemcg commented 20 hours ago

I would love to check out the enterprise version, thank you. I am very impressed with how you are making everything cross-platform. Its seems like so much work.

You probably already know this, but lawyers will be a good fit for using LARS, particularly the ones that handle appeals like myself. When I started doing appeals 15 years ago, it was so tedious, but with all of the tools available now (like LARS), it is much more enjoyable. Of course, the vast majority of lawyers are not tech-savvy, so a lot of stuff is still too complicated for them at the moment. Thanks again for your hard work.

JESSE McGOWAN Attorney at Law | SBN. 250320 2794 Gateway Rd. Ste. 109 Carlsbad, CA 92009 760 440-5520

On Fri Oct 11, 2024, 08:12 AM GMT, Abheek Gulati @.***> wrote:

@jessemcg https://github.com/jessemcg Thank you so much for your generous donation today!! Truly, I'm extremely grateful & humbled. I'd love to give you access to my LARS-Enterprise private repository, which, amongst a host of UI and QoL updates (including deletion and renaming of chats in the sidebar) contains the following major feature updates:

  1. Support for the Llama3.2-Vision LLM to visually analyse documents in any format in a live chat: separate from uploading docs to the VectorDB for RAG, you can attach a document/Image in any format and have Llama3.2-Vision look at it and respond. I've also implemented full conversational flow for this model, so you can upload files for visual analysis, ask follow-up questions and even use RAG, all with a single instance of the Vision LLM!
  2. ⁠Image generation: FLUX is now supported!
  3. ⁠RAG-Citations pipeline has been significantly improved: BM25-Indexing is introduced into LARS to augment the existing embeddings + re-ranking pipeline And it’s all still 100% local: no data for any of the above leaves your machine! Multi-modal capabilites have been made possible entirely thanks to the flexibility of my HF-Waitress LLM server. Do let me know if you're interested and I'll add you to my private repo! Thanks so much once again! — Reply to this email directly, view it on GitHub https://github.com/abgulati/LARS/issues/28#issuecomment-2406851611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILVSN4MONA5FKZMEHUHF73Z26B5PAVCNFSM6AAAAABO7RLSTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBWHA2TCNRRGE. You are receiving this because you were mentioned.Message ID: @.***>
abgulati commented 16 hours ago

Jesse it’s fantastic to make your acquaintance and hear of your extensive background in law! Absolutely, the law field has very much been a huge area of interest for LARS, and I’ve been on the lookout for collaborators in the space. I’m working with partners in the accounting domain, and having conversations in the medical space, and have been really looking forward to collaborating in the law space so the timing for this connect couldn’t have been better!

Also thrilled to hear you’d like to try out LARS-Enterprise, I’ll setup your access. The transition should be seamless and simple and I’m happy to help resolve any hiccups along the way.

All my contact information is in my signature below, let’s further this conversation beyond this issues thread!

Best regards, Abheek Gulati (437) 556-9998 @.*** https://www.linkedin.com/in/abheek-gulatihttps://www.linkedin.com/in/abheek-gulati?utm_source=share&utm_campaign=share_via&utm_content=profile&utm_medium=ios_app


From: Jesse McGowan @.> Sent: Friday, October 11, 2024 9:22:32 AM To: abgulati/LARS @.> Cc: Abheek Gulati @.>; State change @.> Subject: Re: [abgulati/LARS] Problem with the "setup_for_llama_cpp_response" method (Issue #28)

I would love to check out the enterprise version, thank you. I am very impressed with how you are making everything cross-platform. Its seems like so much work.

You probably already know this, but lawyers will be a good fit for using LARS, particularly the ones that handle appeals like myself. When I started doing appeals 15 years ago, it was so tedious, but with all of the tools available now (like LARS), it is much more enjoyable. Of course, the vast majority of lawyers are not tech-savvy, so a lot of stuff is still too complicated for them at the moment. Thanks again for your hard work.

JESSE McGOWAN Attorney at Law | SBN. 250320 2794 Gateway Rd. Ste. 109 Carlsbad, CA 92009 760 440-5520

On Fri Oct 11, 2024, 08:12 AM GMT, Abheek Gulati @.***> wrote:

@jessemcg https://github.com/jessemcg Thank you so much for your generous donation today!! Truly, I'm extremely grateful & humbled. I'd love to give you access to my LARS-Enterprise private repository, which, amongst a host of UI and QoL updates (including deletion and renaming of chats in the sidebar) contains the following major feature updates:

  1. Support for the Llama3.2-Vision LLM to visually analyse documents in any format in a live chat: separate from uploading docs to the VectorDB for RAG, you can attach a document/Image in any format and have Llama3.2-Vision look at it and respond. I've also implemented full conversational flow for this model, so you can upload files for visual analysis, ask follow-up questions and even use RAG, all with a single instance of the Vision LLM!
  2. ⁠Image generation: FLUX is now supported!
  3. ⁠RAG-Citations pipeline has been significantly improved: BM25-Indexing is introduced into LARS to augment the existing embeddings + re-ranking pipeline And it’s all still 100% local: no data for any of the above leaves your machine! Multi-modal capabilites have been made possible entirely thanks to the flexibility of my HF-Waitress LLM server. Do let me know if you're interested and I'll add you to my private repo! Thanks so much once again! — Reply to this email directly, view it on GitHub https://github.com/abgulati/LARS/issues/28#issuecomment-2406851611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILVSN4MONA5FKZMEHUHF73Z26B5PAVCNFSM6AAAAABO7RLSTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBWHA2TCNRRGE. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://github.com/abgulati/LARS/issues/28#issuecomment-2407742397, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEZB3X7MAXV5R5TBTRXV5PTZ273MRAVCNFSM6AAAAABO7RLSTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBXG42DEMZZG4. You are receiving this because you modified the open/close state.Message ID: @.***>

abgulati commented 11 hours ago

Hi @jessemcg ,

Unfortunately, GitHub does not let me manage access rights in a secure enough fashion when adding collaborators to a private-repository. If you're okay with it, I can reach out to you on your mobile number/LinkedIN, where we can exchange emails and I can share a GDrive link (or we can workout any other preferred method) to LARS-Enterprise v2.7, complete feature-set below. Do let me know, thank you!

  1. Attach files in any format directly to user-input field for analysis via Llama3.2-Vision LLMs
  2. Full conversation flow with memory: Ask follow-up questions!
  3. RAG-Enabled: Use the same Vision LLM for usual RAG-queries over the exisitng VectorDB+BM25-Index+Re-ranker pipeline!
  1. Fully local image-generation using black-forest-labs/FLUX.1 dev & schnell models
  2. Includes options for CPU-offloading and quantization (under 16GB VRAM!)
  3. Requires transformers v4.44.0 (below 4.45), modifying hf_waitress.py code to remove MllamaForConditionalGeneration import from transformers and optimum-quanto for quantization
  1. Whoosh-backed BM25F index create alongside VectorDB chunks when a document is uploaded
  2. Both queried at response time and results re-ranked for relevance
  3. Pending: separate indexes for each VectorDB, presently, one index is shared by all and clearing one clears the entire index. This limitation will be addressed in a near-future update!
  1. Modularized several functions for improved code-reuse, readibility and maintainability
  2. Split JS into 5 separate files for much improved maintainability and readibility
  3. Removed various global variables, front-end JS completely stateless