abgulati / LARS

An application for running LLMs locally on your device, with your documents, facilitating detailed citations in generated responses.
https://www.youtube.com/watch?v=Mam1i86n8sU&ab_channel=AbheekGulati
GNU Affero General Public License v3.0
504 stars 39 forks source link
genai llms rag

LARS - The LLM & Advanced Referencing Solution

LARS is an application that enables you to run LLM's (Large Language Models) locally on your device, upload your own documents and engage in conversations wherein the LLM grounds its responses with your uploaded content. This grounding helps increase accuracy and reduce the common issue of AI-generated inaccuracies or "hallucinations." This technique is commonly known as "Retrieval Augmented Generation", or RAG.

There are many desktop applications for running LLMs locally, and LARS aims to be the ultimate open-source RAG-centric LLM application. Towards this end, LARS takes the concept of RAG much further by adding detailed citations to every response, supplying you with specific document names, page numbers, text-highlighting, and images relevant to your question, and even presenting a document reader right within the response window. While all the citations are not always present for every response, the idea is to have at least some combination of citations brought up for every RAG response and that’s generally found to be the case.

Here's a list detailing LARS's feature-set as it stands today:

  1. Advanced Citations: The main showcase feature of LARS - LLM-generated responses are appended with detailed citations comprising document names, page numbers, text highlighting and image extraction for any RAG centric responses, with a document reader presented for the user to scroll through the document right within the response window and download highlighted PDFs
  2. Vast number of supported file-formats:
    • PDFs
    • Word files: doc, docx, odt, rtf, txt
    • Excel files: xls, xlsx, ods, csv
    • PowerPoint presentations: ppt, pptx, odp
    • Image files: bmp, gif, jpg, png, svg, tiff
    • Rich Text Format (RTF)
    • HTML files
  3. Conversion memory: Users can ask follow-up questions, including for prior conversations
  4. Full chat-history: Users can go back and resume prior conversations
  5. Users can force enable or disable RAG at any time via Settings
  6. Users can change the system prompt at any time via Settings
  7. Drag-and-drop in new LLMs - change LLM's via Settings at any time
  8. Built-in prompt-templates for the most popular LLMs and then some: Llama3, Llama2, ChatML, Phi3, Command-R, Deepseek Coder, Vicuna and OpenChat-3.5
  9. Pure llama.cpp backend - No frameworks, no Python-bindings, no abstractions - just pure llama.cpp! Upgrade to newer versions of llama.cpp independent of LARS
  10. GPU-accelerated inferencing: Nvidia CUDA-accelerated inferencing supported
  11. Tweak advanced LLM settings - Change LLM temperature, top-k, top-p, min-p, n-keep, set the number of model layers to be offloaded to the GPU, and enable or disable the use of GPUs, all via Settings at any time
  12. Four embedding models - sentence-transformers/all-mpnet-base-v2, BGE-Base, BGE-Large, OpenAI Text-Ada
  13. Sources UI - A table is displayed for the selected embedding model detailing the documents that have been uploaded to LARS, including vectorization details such as chunk_size and chunk_overlap
  14. A reset button is provided to empty and reset the vectorDB
  15. Three text extraction methods: a purely local text-extraction option and two OCR options via Azure for better accuracy and scanned document support - Azure ComputerVision OCR has an always free-tier
  16. A custom parser for the Azure AI Document-Intelligence OCR service for enhanced table-data extraction while preventing double-text by accounting for the spatial coordinates of the extracted text

A demonstration video showcasing these features can be viewed at the link below:

LARS Feature-Demonstration Video

LARS Feature-Demonstration Video

Table of Contents

  1. LARS - The LLM & Advanced Referencing Solution
  2. Dependencies
  3. Installing LARS
  4. Usage - First Run
  5. Optional Dependencies
  6. Troubleshooting Installation Issues
  7. First Run with llama.cpp
  8. General User Guide - Post First-Run Steps
  9. Troubleshooting
  10. Docker - Deploying Containerized LARS
  11. Current Development Roadmap
  12. Support and Donations

Dependencies

  1. Python v3.10.x or above: https://www.python.org/downloads/

  2. PyTorch:

    If you're planning to use your GPU to run LLMs, make sure to install the GPU drivers and CUDA/ROCm toolkits as appropriate for your setup, and only then proceed with PyTorch setup below

    Download and install the PyTorch version appropriate for your system: https://pytorch.org/get-started/locally/

Installing LARS

  1. Clone the repository:

    git clone https://github.com/abgulati/LARS
    cd LARS
    • If prompted for GitHub authentication, use a Personal Access Token as passwords are deprecated. Also accessible via:
      GitHub Settings -> Developer settings (located on the bottom left!) -> Personal access tokens
  2. Install Python dependencies:

    • Windows via PIP:

      pip install -r .\requirements.txt
    • Linux via PIP:

      pip3 install -r ./requirements.txt
    • Note on Azure: Some required Azure libraries are NOT available on the MacOS platform! A separate requirements file is therefore included for MacOS excluding these libraries:

    • MacOS:

      pip3 install -r ./requirements_mac.txt

Back to Table of Contents

Usage - First Run

Back to Table of Contents

Optional Dependencies

llama.cpp - Installation Instructions:

1. Build Tools:

2. llama.cpp:

Nvidia CUDA (if supported Nvidia GPU present):

LibreOffice:

Poppler:

PyTesseract (optional):

Back to Table of Contents

Troubleshooting Installation Issues

Python Issues:

  1. Remove version numbers:

    • If a specific package version causes an error, edit the corresponding requirements.txt file to remove the version constraint, that is the ==version.number segment, for example:
      urllib3==2.0.4
      becomes simply:
      urllib3
  2. Create and use a Python virtual environment:

    • It's advisable to use a virtual environment to avoid conflicts with other Python projects

    • Windows:

      • Create a Python virtual environment (venv):

        python -m venv larsenv
      • Activate, and subsequently use, the venv:

        .\larsenv\Scripts\activate
      • Deactivate venv when done:

        deactivate
    • Linux and MacOS:

      • Create a Python virtual environment (venv):

        python3 -m venv larsenv
      • Activate, and subsequently use, the venv:

        source larsenv/bin/activate
      • Deactivate venv when done:

        deactivate
  3. If problems persist, consider opening an issue on the LARS GitHub repository for support.

Other Issues:

This typically indicates an issue with your Microsoft Visual Studio build tools, as CMake is unable to find the nmake tool, which is part of the Microsoft Visual Studio build tools. Try the below steps to resolve the issue:

  1. Ensure Visual Studio Build Tools are Installed:

    • Make sure you have the Visual Studio build tools installed, including nmake. You can install these tools through the Visual Studio Installer by selecting the Desktop development with C++ workload, and the MSVC and C++ CMake Optionals

    • Check Step 0 of the Dependencies section, specifically the screenshot therein

  2. Check Environment Variables:

    • Ensure that the paths to the Visual Studio tools are included in your system's PATH environment variable. Typically, this includes paths like:
      C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build
      C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\IDE
      C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools
  3. Use Developer Command Prompt:

    • Open a "Developer Command Prompt for Visual Studio" which sets up the necessary environment variables for you

    • You can find this prompt from the Start menu under Visual Studio

  4. Set CMake Generator:

    • When running CMake, specify the generator explicitly to use NMake Makefiles. You can do this by adding the -G option:
      cmake -G "NMake Makefiles" -B build -DLLAMA_CUDA=ON
  5. If problems persist, consider opening an issue on the LARS GitHub repository for support.

First Run with llama.cpp

Back to Table of Contents

General User Guide - Post First-Run Steps

  1. Document Formats Supported:

    • If LibreOffice is installed and added to PATH as detailed in Step 4 of the Dependencies section, the following formats are supported:

      • PDFs
      • Word files: doc, docx, odt, rtf, txt
      • Excel files: xls, xlsx, ods, csv
      • PowerPoint presentations: ppt, pptx, odp
      • Image files: bmp, gif, jpg, png, svg, tiff
      • Rich Text Format (RTF)
      • HTML files
    • If LibreOffice is not setup, only PDFs are supported

  2. OCR Options for Text Extraction:

    • LARS provides three methods for extracting text from documents, accommodating various document types and quality:

      • Local Text Extraction: Uses PyPDF2 for efficient text extraction from non-scanned PDFs. Ideal for quick processing when high accuracy is not critical, or entirely local processing is a necessity.

      • Azure ComputerVision OCR - Enhances text extraction accuracy and supports scanned documents. Useful for handling standard document layouts. Offers a free tier suitable for initial trials and low-volume use, capped at 5000 transactions/month at 20 transactions/minute.

      • Azure AI Document Intelligence OCR - Best for documents with complex structures like tables. A custom parser in LARS optimizes the extraction process.

      • NOTES:

        • Azure OCR options incur API-costs in most cases and are not bundled with LARS.

        • A limited free-tier for ComputerVision OCR is available as linked above. This service is cheaper overall but slower and may not work for non-standard document layouts (other than A4 etc).

        • Consider the document types and your accuracy needs when selecting an OCR option.

  3. LLMs:

    • Only local-LLMs are presently supported

    • The Settings menu provides many options for the power-user to configure and change the LLM via the LLM Selection tab

    • Note if using llama.cpp: Very-Important: Select the appropriate prompt-template format for the LLM you're running

      • LLMs trained for the following prompt-template formats are presently supported via llama.cpp:

        • Meta Llama-3
        • Meta Llama-2
        • Mistral & Mixtral MoE LLMs
        • Microsoft Phi-3
        • OpenHermes-2.5-Mistral
        • Nous-Capybara
        • OpenChat-3.5
        • Cohere Command-R and Command-R+
        • DeepSeek Coder
    • Tweak Core-configuration settings via Advanced Settings (triggers LLM-reload and page-refresh):

      • Number of layers offloaded to the GPU
      • Context-size of the LLM
      • Maximum number of tokens to be generated per response
    • Tweak settings to change response behavior at any time:

      • Temperature – randomness of the response
      • Top-p – Limit to a subset of tokens with a cumulative probability above
      • Min-p – Minimum probability for considering a token, relative to most likely
      • Top-k – Limit to K most probable tokens
      • N-keep – Prompt-tokens retained when context-size exceeded (-1 to retain all)
  4. Embedding models and Vector Database:

    • Four embedding models are provided in LARS:

      • sentence-transformers/all-mpnet-base-v2 (default)
      • bge-base-en-v1.5
      • bge-large-en-v1.5 (highest MTEB ranked model available in LARS)
      • Azure-OpenAI Text-Ada (incurs API cost, not bundled with LARS)
    • With the exception of the Azure-OpenAI embeddings, all other models run entirely locally and for free. On first run, these models will be downloaded from the HuggingFace Hub. This is a one-time download and they'll subsequently be present locally.

    • The user may switch between these embedding models at any time via the VectorDB & Embedding Models tab in the Settings menu

    • Docs-Loaded Table: In the Settings menu, a table is displayed for the selected embedding model displaying the list of documents embedded to the associated vector-database. If a document is loaded multiple times, it’ll have multiple entries in this table, which could be useful for debugging any issues.

    • Clearing the VectorDB: Use the Reset button and provide confirmation to clear the selected vector database. This creates a new vectorDB on-disk for the selected embedding model. The old vectorDB is still preserved and may be reverted to by manually modifying the config.json file.

  5. Edit System-Prompt:

    • The System-Prompt serves as an instruction to the LLM for the entire conversation

    • LARS provides the user with the ability to edit the System-Prompt via the Settings menu by selecting the Custom option from the dropdown in the System Prompt tab

    • Changes to the System-Prompt will start a new chat

  6. Force Enable/Disable RAG:

    • Via the Settings menu, the user may force enable or disable RAG (Retrieval Augmented Generation – the use of content from your documents to improve LLM-generated responses) whenever required

    • This is often useful for the purposes of evaluating LLM responses in both scenarios

    • Force disabling will also turn off attribution features

    • The default setting, which uses NLP to determine when RAG should and shouldn’t be performed, is the recommended option

    • This setting can be changed at any time

  7. Chat History:

    • Use the chat history menu on the top-left to browse and resume prior conversations

    • Very-Important: Be mindful of prompt-template mismatches when resuming prior conversations! Use the Information icon on the top-right to ensure the LLM used in the prior-conversation, and the LLM presently in use, are both based on the same prompt-template formats!

  8. User rating:

    • Each response may be rated on a 5-point scale by the user at any time

    • Ratings data is stored in the chat-history.db SQLite3 database located in the app directory:

      • Windows: C:/web_app_storage
      • Linux: /app/storage
      • MacOS: /app
    • Ratings data is very valuable for evaluation and refinement of the tool for your workflows

  9. Dos and Don’ts:

    • Do NOT tweak any settings or submit additional queries while a response to a query is already being generated! Wait for any ongoing response generation to complete.

Back to Table of Contents

Troubleshooting

Back to Table of Contents

Docker - Deploying Containerized LARS

Background and Setup

Building & Running the CPU-Inferencing Container

Building & Running the Nvidia-CUDA GPU-Enabled Container

Special Note for Containers - Troubleshooting Networking Issues and Errors on First Run

Special Note for Containers - Updating the Container Image Post-First-Run

Back to Table of Contents

Current Development Roadmap

Category Tasks Status
Bug fixes: Zero-Byte text-file creation hazard - Sometimes if OCR/Text-Extraction of the input document fails, a 0B .txt file may be left over which causes further retry attempts to believe the file has already been loaded :calendar: Future Task
Practical Features: Ease-of-use centric:
Azure CV-OCR free-tier UI toggle :white_check_mark: Done on 8th June 2024
Delete Chats :calendar: Future Task
Rename Chats :calendar: Future Task
PowerShell Installation Script :calendar: Future Task
Linux Installation Script :calendar: Future Task
Ollama LLM-inferencing backend as an alternative to llama.cpp :calendar: Future Task
Integration of OCR services from other cloud providers (GCP, AWS, OCI, etc.) :calendar: Future Task
UI toggle to ignore prior text-extracts when uploading a document :calendar: Future Task
Modal-popup for file uploads: mirror text-extraction options from settings, global over-write on submissions, toggle to persist settings :calendar: Future Task
Performance-centric:
Nvidia TensorRT-LLM AWQ Support :calendar: Future Task
Research Tasks: Investigate Nvidia TensorRT-LLM: Necessitates building AWQ-LLM TRT-engines specific to the target GPU, NvTensorRT-LLM is its own ecosystem and only works on Python v3.10. :white_check_mark: Done on 13th June 2024
Local OCR with Vision LLMs: MS-TrOCR (done), Kosmos-2.5 (high Priority), Llava, Florence-2 :construction_worker: In-Progress 5th July 2024 Update
RAG Improvements: Re-ranker, RAPTOR, T-RAG :calendar: Future Task
Investigate GraphDB integration: using LLMs to extract entity-relationship data from documents and populate, update & maintain a GraphDB :calendar: Future Task

Back to Table of Contents

Support and Donations

I hope that LARS has been valuable in your work, and I invite you to support its ongoing development! If you appreciate the tool and would like to contribute to its future enhancements, consider making a donation. Your support helps me to continue improving LARS and adding new features.

How to Donate To make a donation, please use the following link to my PayPal:

Donate via PayPal

Your contributions are greatly appreciated and will be used to fund further development efforts.

Back to Table of Contents