Chatbot on Govt Documents

Gautam-Rajeev commented 8 months ago

Goal

Create a bot that is able to answer questions asked by users based on RAG framework on government data sourced by parsing PDFs.

Description

We have a number of PDFs in Hindi English. They are officially typed docs as well as scanned documents.
A user should be able to ask questions that can be answered by content present in the docs and the bot should be able to retrieve the relevant content from the PDFs and be able to answer the question in a a cohesive, accurate fashion

Implementation details

This will involve:

Collecting data from sources like upvidhai.gov.in/Act.aspx, shasanadesh.up.gov.in.
Being able to parse documents and extract English text (from text layer of the PDFs)
Being able to parse documents using OCR and extract Hindi text from PDFs
Structure the extracted text into sensible chunks and store them in a DB
Translate the text to required languages
Being able to understand natural language queries and search for related content (possible through veector DBs and representation of content)
Using an LLM to be able to give a cohesive and relevant answer based on question and retrieved content.

Students to be selected

2 students

Requirements

-Python, Pytorch, Vector DBs, LLM , NLP

sarthak13agr commented 8 months ago

Should we also add scraping websites like https://www.upvidhai.gov.in/Act.aspx, https://shasanadesh.up.gov.in. et al in the implementation details?

sarthak13agr commented 8 months ago

https://electiongpt.idinsight.io/auth/login ^This is what we have built with IDinsight for election work - I had something similar in mind for this project (if not better!).

NishantSatere commented 7 months ago

@GautamR-Samagra I have previously developed a chatbot trained on legal documents that includes the capability to interact with PDF files. Additionally, I integrated various vector databases and utilized PyTorch in my hackathon projects. I would like to discuss this project further.

ItshMoh commented 7 months ago

Hello @GautamR-Samagra I am a student at IIT BHU. I have been passionate about GEN-AI and their uses in making the life easier. I have done projects in NLP. I have past experiences in using LLMs. I have done internship in TEXTR- AI a startup for SEO automation using AI. Here my work was to make a chat bot which can answer the queries of their customers, designing prompts and integrating security to prevent prompt injection. I have experience with frameworks like Pytorch, TensorFlow, LangChain. I have also experience with libraries like OpenCV, Transformers. I have made a similar chatbot for asking any questions from a PDF. I have also integrated voice input and voice output in the chatbot. Here is the link to the repo. I have used "FAISS-cpu"as the Vector database. I have used "Whisper" for audio transcription and "gTTS" for text to speech conversion. I have used "GoogleGenerativeAIEmbeddings" for converting the text into embeddings. I have used langchain as a framework for accessing the tools. I have hosted it on Gradio. As per my past experiences langchain would be very useful for this task as it provides access to many tools. As per my thoughts these will be techs to use.

We can do scraping with the help of BeautifulSoup and other scrapers like selenium etc.
For OCR we can use the TrOCR. It is a transformer-based OCR. It really has amazing results. It is also finetuned for the handwritten documents. So, if our scanned documents have some handwritten texts they can be recognized easily. I have personally used it and also it is open source, available on HuggingFace.
For reading and parsing the PDF we can take the help of library like PyPDF , PDFMiner etc.
For parting the text into chunks, we can take the help of Text Splitter. There are many text splitters, and we can make best use of them via LangChain. There are many splitters like RecursiveCharacterTextSplitter and others are available on Langchain. I have used them, and they gave amazing results.
We would then convert the chunks into embeddings. We can use GoogleGenerativeAI embeddings which are free, or we can use OpenAI embeddings. We can also other open-source embeddings model as available on HuggingFace.
We can used Vector databases like Chroma DB, FAISS-cpu, qdrant etc for storing the embeddings. FAISS-cpu will be a very good choice.
We can choose Gemini or OpenAI as the LLM model required. We can also use some open source LLMs like *Mistral 78B, LLAMA 2 7b, Gemma** etc depending on our computational resources.
The workflow will be like when a user will give a query a similarity search would take place into the Vector DB, and we would choose the number of top k documents we will need. Then our model will iterate through these relevant docs picked after performing similarity search. These docs will serve like a context to our LLM, and our model will answer the query.

I am open to any suggestions and discussions in the comments. I am looking forward to collaborte with amazing folks and contribute to this project.

Mohan Kumar mohan.kumar.min22@itbhu.ac.in

Shashankss1205 commented 7 months ago

Dear @GautamR-Samagra ,

I am Shashank Shekhar Singh, a sophomore at Indian Institute of Technology (BHU), Varanasi, India. I'm thrilled to express my keen interest in contributing to the development of the Government Document Chatbot project. With my diverse skill set and extensive experience in machine learning, natural language processing (NLP), and web development, I believe I can make significant contributions to this endeavor.

Having worked on projects like NLP-based Question Answering System for Tabular Data and Visual Question Answering System, I have hands-on experience in developing systems that involve parsing and understanding textual information.

Additionally, my project on Dark Pattern Detection with Transformer-based Models which made me the the finalist for Round 3 at DPBH hackathon conducted by Government of India, 2024 demonstrates my proficiency in leveraging advanced machine learning techniques for text analysis and classification, which could be invaluable for identifying relevant content within government documents.

Furthermore, my involvement in projects such as Text Detection and Recognition with CRAFT and NLP-based Coding Automation with OpenAI GPT-3.5 showcases my expertise in handling text extraction from various sources and integrating cutting-edge NLP models into practical applications.

Online courses like the Machine Learning Specialization by Andrew Ng and HTML-CSS-JAVASCRIPT by Free Code Camp, has equipped me with a solid foundation in both technical and theoretical aspects relevant to this project.

Moreover, my achievements in national-level competitions and my active participation in various technical clubs and organizations demonstrate my passion for technology and my ability to thrive in collaborative environments.

I am genuinely excited about the prospect of contributing to the mission of streamlining access to government information through innovative technology solutions. I look forward to working with the team and leveraging my skills to drive the success of this project.

Best regards, Shashank Shekhar Singh

ItshMoh commented 7 months ago

Hello @GautamR-Samagra I have implemented a small version of the tasks. I have uploaded it on this repo. I have taken the image of a page of a pdf of the document that i have downloaded from https://www.upvidhai.gov.in/Act.aspx . Here i have used the Easy OCR for text detection in Devnagri scripts. I have splitted it using RecursiveCharacterTextSplitter with Lang chain. I have then converted it into embeddings with the help of Google generative Ai embeddings. I stored the embeddings to FAISS-cpu. It works fine on hindi language as it has been highly trained on Hindi texts. I have set a prompt template for our LLm. The instructions clearly say that Always answer according the context otherwise say I don't know. I have used Gemini pro as my LLm. I have made a query related to the document and model has given the correct answer.

My next goal is to make a function which can scrap the whole document from https://www.upvidhai.gov.in/Act.aspx https://shasanadesh.up.gov.in/ and convert each page into image which can be fed into OCR and other steps would proceed like the above steps. Other main issue will be the splitting of the texts as it will be playing an important role. I am trying other splitters which can understand the indic language semantics much better. I would update you about my further process.

What's your review on this. Am I going in the right direction if not guide me. I am open to suggestions in the comments. 🙌

Mohan Kumar mohan.kumar.min22@itbhu.ac.in

Adii2202 commented 7 months ago

Hello @GautamR-Samagra,

I am Aditya Ningule from SPIT Mumbai, currently in my third year. I am deeply interested in Open Source projects and Hackathons, and I have a track record of winning several hackathons. My experience spans working with LLM's, AI Models, NLP, and RAG. Recently, I have developed a voice-to-voice model of RAG which i think best suites to your requirements, which allows for conversational interaction using a PDF input. You can find the project at this link: RAG-AI-Voice-assistant. Additionally, I have experience working with vector databases such as Qdrant. For specifically understanding about the project I have mentioned the Block diagrams. Have a look at my GitHub profile: Adii2202

Considering your need for a chatbot for government documents, I'd like to highlight my past work on chatbots, which I have developed during numerous hackathons.

I believe that my experience and skills could significantly contribute to your project. I am eager to discuss this opportunity further.

Looking forward to the possibility of collaborating with you.

Regards, Aditya Ningule

mrNitesh14 commented 7 months ago

Hi @GautamR-Samagra ,

My name is Nitesh Rathod, a third-year student at SPIT Mumbai. I'm passionate about contributing to open-source projects and have a strong interest in hackathons.

I'm proficient in various areas, including Large Language Models (LLMs), AI models, and Natural Language Processing (NLP). I recently developed a chatbot application designed to assist doctors in retrieving patient reports. This chatbot utilizes Optical Character Recognition (OCR) technology to extract data from PDFs and images, which is then used to facilitate information retrieval through a chatbot interface.

I came across your project seeking a chatbot for government documents, and love to contribute to this project.

I've attached my GitHub profile for your reference: mrNitesh14: https://github.com/mrNitesh14.

I'm confident that my skills and experience could significantly contribute to your project. I'd be eager to discuss this opportunity further and explore potential collaboration possibilities.

Thank you for your time and consideration.

Sincerely,

Nitesh Rathod

vipul6042 commented 7 months ago

Hello @GautamR-Samagra I am a student at IIT BHU. I am excited about LLMs. I have done projects related to explore the power of LLMs. I have experience with OCR like Tesseract and easy OCR. I have experience with LangChain, LLAMAINDEX and RAG. I have done projects with using vector databases like Pinecone, Qdrant. I have started working on this project. I am trying to make a prototype of this model. I have done the ocr and scraping part. I am working on prompts for the model and storing the embeddings. I will update you about the progress.

Vipul vipulkumar2426@gmail.com

jhshreya commented 6 months ago

Hey Admin, i can resolve this issue please assign this issue to me

District-Administration-Varanasi / document-chatbot