Welcome to the Chatbot - ML Repository!
This repository contains a machine learning project designed to process PDF documents, extract text, split it into smaller chunks, generate embeddings using Google’s Generative AI, and store them in a FAISS vector store for fast retrieval. The system enables question answering based on the contents of the document.
RecursiveCharacterTextSplitter
from Langchain.PDF Text Extraction:
PDF text extraction is fully functional using PyMuPDF.
Text Splitting:
Text splitting with Langchain's RecursiveCharacterTextSplitter
operates seamlessly.
Embedding Generation:
Integration with Google’s Generative AI API for embedding generation is fully operational.
Vector Storage & Retrieval:
Embeddings are stored in a FAISS vector store, enabling fast similarity-based retrieval.
Question Answering:
Basic question-answering using Langchain’s QA chain is successfully working.
Enhanced Error Handling:
Added error checks for invalid PDF inputs and missing API keys to improve reliability.
Improved Code Modularity:
Refactored code into modular components to simplify maintenance and future expansion.
Unit Testing:
Initial unit tests for key functionalities (e.g., PDF extraction and text splitting) have been implemented.
Performance Optimizations:
Integrated Intel’s Scikit-learn extension to boost performance.
Documentation and Comments:
Expanded inline comments and documentation have been added to facilitate onboarding and troubleshooting.
The project’s main functionalities are structured into several distinct components:
Purpose:
Load a PDF file and extract its text content.
Implementation:
Utilizes PyMuPDF (fitz) to open a PDF and iterate through each page, concatenating the text.
Example Code:
import fitz
pdf = fitz.open("document.pdf")
text = "\n".join(page.get_text() for page in pdf)
Purpose:
Divide large text documents into smaller, more manageable chunks to enhance embedding generation.
Implementation:
Uses Langchain's RecursiveCharacterTextSplitter
with configurable chunk size and overlap.
Example Code:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
chunks = text_splitter.split_text(text)
Purpose:
Transform text chunks into numerical embeddings that represent semantic content.
Implementation:
Integrates with Google’s Generative AI via Langchain's Google Generative AI embeddings module.
Example Code:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
embeddings = GoogleGenerativeAIEmbeddings(api_key="YOUR_API_KEY")
vector_store = FAISS.from_texts(chunks, embeddings)
Purpose:
Efficiently store the generated embeddings and retrieve them based on similarity.
Implementation:
Utilizes FAISS as the vector store, enabling quick similarity-based searches.
Functionality:
When a user poses a query, the system compares the query’s embedding to those stored, retrieving the most relevant document sections.
Purpose:
Provide answers to user queries by leveraging the stored embeddings and document content.
Implementation:
Employs Langchain’s question-answering chain (using methods like map-reduce) to generate answers.
Example Code:
from langchain.chains import load_qa_chain
qa_chain = load_qa_chain(ChatGoogleGenerativeAI(), chain_type="map_reduce")
result = qa_chain.run(input_document=document, question="Your question here")
To set up the project, follow these steps:
Clone the repository:
git clone https://github.com/Tech-Society-SEC/Chatbot_ML.git
Navigate to the project directory:
cd Chatbot_ML
Install the necessary libraries:
pip install scikit-learn-intelex pymupdf langchain-google-genai langchain-community python-dotenv faiss-cpu
Mount Google Drive (if needed):
from google.colab import drive
drive.mount('/content/drive')
Configure the API Key:
Create a .env
file and store your Google API key:
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('GOOGLE_API_KEY')
Optimize scikit-learn:
from sklearnex import patch_sklearn
patch_sklearn()
We welcome contributions! Here are some beginner-friendly tasks:
Improve Documentation:
Enhance inline comments and expand the README with more examples and troubleshooting tips.
Add Unit Tests:
Develop tests for functionalities like text extraction, text splitting, and embedding generation.
Enhance Error Handling:
Implement checks and meaningful error messages for cases such as invalid PDFs or absent API keys.
Create a Simple CLI:
Build a command-line interface for loading PDFs, submitting queries, and displaying results.
Optimize Chunk Size:
Experiment with different text chunk sizes to find the optimal balance for embedding quality and performance.
For more details and to access the code, visit the GitHub Repository.