Opiniated RAG for integrating GenAI in your apps 🧠 Focus on your product rather than the RAG. Easy integration in existing products with customisation! Any LLM: GPT4, Groq, Llama. Any Vectorstore: PGVector, Faiss. Any Files. Anyway you want.
This pull request introduces the SpaCyProcessor class to handle various text file types (PDF, DOCX, TXT, and CSV) and perform NLP processing using spaCy. This addition includes:
Key Features:
File Extraction: Supports asynchronous text extraction from PDFs using fitz (PyMuPDF), DOCX files via python-docx, and handling of TXT and CSV files.
NLP Processing: Integrates spaCy's NLP pipeline for entity recognition and sentence tokenization, adding metadata on entities and sentences in each document chunk.
Document Chunking: Implements RecursiveCharacterTextSplitter to divide documents into manageable chunks with specified overlap, ensuring consistent chunk sizes.
Error Handling and Logging: Provides robust logging for extraction errors and validation checks, improving traceability.
Motivation:
This feature adds support for spaCy NLP processing to enable richer text analysis and processing across various file types. The processor now efficiently handles different file formats, extracts meaningful text, and applies NLP, making it easier to work with structured document data in downstream applications.
Checklist before requesting a review
Please delete options that are not relevant.
[ ] My code follows the style guidelines of this project
[ ] I have performed a self-review of my code
[ ] I have commented hard-to-understand areas
[ ] New and existing unit tests pass locally with my changes
Description
This pull request introduces the SpaCyProcessor class to handle various text file types (PDF, DOCX, TXT, and CSV) and perform NLP processing using spaCy. This addition includes:
Key Features:
Motivation: This feature adds support for spaCy NLP processing to enable richer text analysis and processing across various file types. The processor now efficiently handles different file formats, extracts meaningful text, and applies NLP, making it easier to work with structured document data in downstream applications.
Checklist before requesting a review
Please delete options that are not relevant.
Screenshots (if appropriate):