Start date 22.08.2024
This project aims to develop an easy-in-use automated system for Semantic searching through the files. Two main scenarios for which the project wants to expand are personal semantic searching (searching for information in books, documents, and other documents), and as an interface for advanced grounding methods for Large Language Models (LLMs) projects line Retrieval Augmented Generation (RAG) or LoRA/QLoRA (Low-Rank Adaptation). As the main goal is availability, the system will support various ways of interactions from GUI, through CLI with Click, and REST API with FastAPI to gRPC with protobuf.
The Context Search functionality searches for documents in the given directories and read them for further processing. When a scanned document occures in the set, Tesseract 5 OCR is used to extract the text from the image. Later, the data is trnsformed, preprocessed, and submitted to the Neo4j database. During the retrieval phase, the exact chunk of text is returned together with the simillarity score.
First Graph created with automatic generation. Graph of three ai-generated articles (pink node), text chunks with embedding (orange nodes), and keywords/tags with embeddings (blue nodes).
Python 3.10+
Neo4j Database
Tesseract OCR (for OCR functionality)
Available:
Windows -> https://github.com/UB-Mannheim/tesseract/wiki
Linux -> https://tesseract-ocr.github.io/tessdoc/Installation.html
Poppler (for OCR)
Available:
Docs -> https://poppler.freedesktop.org/
The project currently is not mature enough to be submitted to PYPI, that's why the preferable way of running Ragger is to clone the solution, run pip install .
and modify the config to match your system needs. If you have tesseract and poppler in the PATH, the Ragger will read it by itself.
Use docker-compose files which are ready to use without any tinkering with the config file.
Refer to examples/example_submit.py
on how to upload files to the database and to examples/example_retrieve.py
on how to retrieve data. The extensive tutorial is "in progress".
Please, hold on with the contribution until the first major release. Feel free to fork and initiate discussion if you want to. Always happy to hear some voices of reason!
This project is licensed under the GPL-3.0 License. See the LICENSE file for details.
For any questions or suggestions, please approach me directly, open the issue or e-mail me via mail r2.acumen@gmail.com.