ArturOle / ContextSearch

Semantic search tool that will take care of all the technicalities for you! (WIP)
GNU General Public License v3.0
1 stars 0 forks source link

ContextSearch - from raw files to efficient Semantic Search (WIP)

Start date 22.08.2024

Overview

This project aims to develop an easy-in-use automated system for Semantic searching through the files. Two main scenarios for which the project wants to expand are personal semantic searching (searching for information in books, documents, and other documents), and as an interface for advanced grounding methods for Large Language Models (LLMs) projects line Retrieval Augmented Generation (RAG) or LoRA/QLoRA (Low-Rank Adaptation). As the main goal is availability, the system will support various ways of interactions from GUI, through CLI with Click, and REST API with FastAPI to gRPC with protobuf.

The Context Search functionality searches for documents in the given directories and read them for further processing. When a scanned document occures in the set, Tesseract 5 OCR is used to extract the text from the image. Later, the data is trnsformed, preprocessed, and submitted to the Neo4j database. During the retrieval phase, the exact chunk of text is returned together with the simillarity score.

alt text First Graph created with automatic generation. Graph of three ai-generated articles (pink node), text chunks with embedding (orange nodes), and keywords/tags with embeddings (blue nodes).

Features

Future Plans

Getting Started

Prerequisites

Build the project

The project currently is not mature enough to be submitted to PYPI, that's why the preferable way of running Ragger is to clone the solution, run pip install . and modify the config to match your system needs. If you have tesseract and poppler in the PATH, the Ragger will read it by itself.

Alternative build (Docker)

Use docker-compose files which are ready to use without any tinkering with the config file.

Run seamntic search

Refer to examples/example_submit.py on how to upload files to the database and to examples/example_retrieve.py on how to retrieve data. The extensive tutorial is "in progress".

Contributing

Please, hold on with the contribution until the first major release. Feel free to fork and initiate discussion if you want to. Always happy to hear some voices of reason!

License

This project is licensed under the GPL-3.0 License. See the LICENSE file for details.

Contact

For any questions or suggestions, please approach me directly, open the issue or e-mail me via mail r2.acumen@gmail.com.