District-Administration-Varanasi / document-chatbot

1 stars 13 forks source link

Government Document Chatbot: Streamlining Access and Assistance #2

Open amit-s19 opened 3 months ago

amit-s19 commented 3 months ago

Ticket Contents

Goal

Create a bot capable of answering user questions based on RAG framework using government data extracted from PDFs.

Description

The project aims to develop a chatbot capable of retrieving relevant information from government documents, including both officially typed and scanned documents in Hindi and English. Users should be able to ask questions, and the bot will extract and present cohesive, accurate answers from the PDFs.

Goals & Mid-Point Milestone

Goals

Technical Tasks

  1. Data Collection and Integration

    • [ ] Collect government data from sources such as upvidhai.gov.in/Act.aspx and shasanadesh.up.gov.in.
    • [ ] Integrate collected data into the chatbot's knowledge base for retrieval.
  2. Language Processing Capability

    • [ ] Develop algorithms for parsing English text from the text layer of PDFs.
    • [ ] Implement OCR algorithms for extracting Hindi text from scanned PDFs.
  3. Natural Language Understanding (NLU)

    • [ ] Implement NLU techniques to understand and interpret user queries accurately.
    • [ ] Develop algorithms to search for relevant content based on user queries.
  4. Content Structuring and Storage

    • [ ] Structure extracted text into cohesive chunks for efficient storage and retrieval.
    • [ ] Store structured content in a database for easy access and management.
  5. Multi-Language Support

    • [ ] Develop translation algorithms to support multiple languages, including Hindi and English.
    • [ ] Ensure seamless translation of content to meet user language preferences.
  6. LLM Integration and Training

    • [ ] Integrate a Language Model (LLM) for generating cohesive answers.
    • [ ] Train the LLM using relevant datasets to align with the style of government documents.

Setup/Installation

No response

Expected Outcome

Acceptance Criteria

No response

Implementation Details

The implementation involves:

Mockups/Wireframes

No response

Product Name

Government Document Chatbot

Organisation Name

SamagraX

Domain

⁠Service Delivery

Tech Skills Needed

JavaScript, Machine Learning, Node.js, Python

Mentor(s)

@ChakshuGautam

Category

Machine Learning

New-dev0 commented 2 months ago

can langchain with openai be used? it has already implemented various LLMs and services for parsing pdfs and text documents.

ItshMoh commented 2 months ago

Hello @amit-s19 I am student at IIT BHU. I have been passionate about GEN-AI. I have previous experiences with LLMs. I have worked as ML deveolper intern in TEXTR-AI a startup for Seo automation. My work was to design the chatbot for handling their customer queries, designing prompts and integrating security features to prohibit prompt injection in their bot. I have previous experiences with frameworks like Pytorch, TensorFlow, Langchain. I have made a similar chatbot similar to this project where a user can upload a pdf and ask some queries related to the pdf to the model via voice message as an input and can get the answer in chat and voice both. Here is the link to the repo. I have used PyPDF as a pdf parser. I have used Recursivecharacter text splitter for splitting the text into chunks. I have used Google Generative AI embeddings for converting text into embeddings. I have used FAISS-cpu as a vector store for the embeddings. I have used Whisper for audio transcription and gTTS for text to speech. I have used langchain for accessing all the tools. As per previous experience and research here are the tecks that can be used.

  1. For scraping the data we take tha help of scrapers.
  2. for ocr task we can take help of Indic Trocr, it is for Loacal languages.
  3. Parsing the text and we can use PyPDF, PyMuPDF, PDFMiner etc.
  4. If the text contains hindi it will be translated to english. Here we can take help of Google translate Api or HuggingFace transformer models. These texts will be stored.
  5. For splitting the text in to chunks we can use text splitters. Many text splitters are available on langchain lke RecusriveChracter TextSplitter .
  6. For converting the chunks into embeddings we can take help of OpenAI embeddings of Google Generative AI embeddings. We can also use open source embedding model available on HuggingFace.
  7. For storing the embeddings we can take the help of Vector stores. We can use FAISS-cpu, Chroma DB, qdrant. FAISS-cpu is a very good choice for handling these embeddings.
  8. We would use Gemini of OpenAI models for task of llm. We can also use some open source models like LLAMA 2 7b, Mistral 7b, Gemma etc depending on our computational resources.
  9. The normal workflow of the bot would be, the user will ask a query to the model. A similarity search will take place in the vector DB. The top k documents which are most relevant to the query will be picked and collected together. Our LLM model which would be operating in the hood will use the selected documents for the context of the chatbot. Our model will give answer based on the context.

I am open to suggestions in the comments. I am looking forward to collaborate with amazing folks and contribute to this project. Mohan Kumar mlwishperer1@gmail.com

Shashankss1205 commented 2 months ago

Dear @amit-s19 ,

I am Shashank Shekhar Singh, a sophomore at Indian Institute of Technology (BHU), Varanasi, India. I'm thrilled to express my keen interest in contributing to the development of the Government Document Chatbot project. With my diverse skill set and extensive experience in machine learning, natural language processing (NLP), and web development, I believe I can make significant contributions to this endeavor.

Having worked on projects like NLP-based Question Answering System for Tabular Data and Visual Question Answering System, I have hands-on experience in developing systems that involve parsing and understanding textual information.

Additionally, my project on Dark Pattern Detection with Transformer-based Models which made me the the finalist for Round 3 at DPBH hackathon conducted by Government of India, 2024 demonstrates my proficiency in leveraging advanced machine learning techniques for text analysis and classification, which could be invaluable for identifying relevant content within government documents.

Furthermore, my involvement in projects such as Text Detection and Recognition with CRAFT and NLP-based Coding Automation with OpenAI GPT-3.5 showcases my expertise in handling text extraction from various sources and integrating cutting-edge NLP models into practical applications.

Online courses like the Machine Learning Specialization by Andrew Ng and HTML-CSS-JAVASCRIPT by Free Code Camp, has equipped me with a solid foundation in both technical and theoretical aspects relevant to this project.

Moreover, my achievements in national-level competitions and my active participation in various technical clubs and organizations demonstrate my passion for technology and my ability to thrive in collaborative environments.

I am genuinely excited about the prospect of contributing to SamagraX's mission of streamlining access to government information through innovative technology solutions. I look forward to working with the team and leveraging my skills to drive the success of this project.

Best regards, Shashank Shekhar Singh

AkanshuAich commented 2 months ago

Hii @amit-s19 ,

I am Akanshu Aich, a third year student from International Institute of Information Technology, Bhubaneswar. I am writing to express my interest in contributing to this project as a part of DSP 2024. Having thoroughly reviewed the project, I am impressed by its objectives and it seeks the potential for great impact in industries.

With my background in Backend using Django , MERN with practicing hands on Machine learning, I believe I can make valuable contributions to both backend and frontend part. My experience includes several projects like Society-Expenditure Manager using Django, Real Estate using MERN and Info-Finding Tool using Machine Learning(LLM), which I believe align well with the goals of your project.

I am particularly interested in fulfilling the requirements of the project and have some ideas on how to approach it effectively. I am committed to adhering to best practices, contributing high-quality code, and actively collaborating with the project maintainers and community.

I am excited about the opportunity to contribute to Government Document Chatbot by SamagraX and help further its mission. I look forward to discussing potential contributions and how I can best support the project.

Please guide me with procedure and with all your knowledge and experience.

Srik9703 commented 2 months ago

Hello @amit-s19 ,

I am a student at VNR VJIET with a fervent interest in GEN-AI and a proven track record with Large Language Models (LLMs). During my internship at Acuration, a forward-thinking startup specializing in AI-driven collaboration solutions, I had the opportunity to work extensively with their flagship LLM, Acuration IQ. My role involved data collection through web scraping, data cleaning, and the development of innovative tools such as a chatbot capable of answering queries related to uploaded PDFs and a comprehensive financial analysis tool.

One of my notable projects involved creating a PDF-based chatbot that enabled users to ask questions and receive accurate responses. This endeavor required the integration of various technologies, including PyPDF for PDF parsing, RecursiveCharacter TextSplitter for text segmentation, and Google Generative AI embeddings for text conversion. To efficiently store and retrieve embeddings, I leveraged ChromaDB, a vector database. Additionally, I incorporated Google Translate API and HuggingFace transformer models for seamless language translation.

This experience has equipped me with a strong foundation in AI technologies and a deep understanding of how LLMs can be harnessed to address real-world challenges effectively. I am excited about the prospect of contributing my skills and expertise to projects that push the boundaries of AI innovation.

I am enthusiastic about the opportunity to contribute to this project and believe that my skills and experiences align well with the requirements. Best regards, Konakalla Srija.

Aditya-132 commented 2 months ago

I am interested to work with you

ChakshuGautam commented 2 months ago

@Aditya-132 @Srik9703 @AkanshuAich @Shashankss1205 @ItshMoh

Comments on the tickets are reserved for discussion on the tickets itself. Please don't share your resume, intent to work or anything that doesn't progress the solution to the problem statement. The perfect response to a ticket is either a PR or some breakthrough that leads to the solution.

I would recommend breaking this problem down to individual sections and start taking a crack and raising a PR for the same. If this would have been me, I would create a new ticket for scraping (point 1 - Collect government data from sources such as upvidhai.gov.in/Act.aspx and shasanadesh.up.gov.in.) and raise a PR that downloads all documents. Second step would be to raise a PR on parsing those documents into text and so on...

For now, I am hiding all comments.

ChakshuGautam commented 2 months ago

can langchain with openai be used? it has already implemented various LLMs and services for parsing pdfs and text documents.

Langchain is fine. (Sorry incorrectly marked as spam)

SarveshAtawane commented 2 months ago

Hello @ChakshuGautam, I wanted to update you on the progress I've made with the document extraction and processing project. I recently submitted a pull request that includes the extraction of documents from websites. Further obtaining the documents, I applied (OCR) techniques on one of the documents to extract the text content from it (needs some more work) and translate it into English. I had previously developed a Question Answering (QA) application using the Mistral open-source model which is on my github. I took the extracted and translated document and integrated it into my existing rag application. I have attached the original document and converted document along with screenshots of some sample queries and their corresponding reference summary obtained from the Gemini output. I would appreciate it if you could provide feedback on the approach, the pr, and the results obtained. output1 output2 Summary by gemini Original_document.pdf transalted_Document.pdf

shashi-sah2003 commented 2 months ago

Hello @ChakshuGautam, I wanted to update you with my progress on this specific project. I have used BeautifulSoup to scrap all the PDFs from these two government websites https://upvidhai.gov.in and https://shasanadesh.up.gov.in/ and as suggested by you I have created a PR also. I would love to hear your suggestions. Thank you

Azazel0203 commented 2 months ago

Hello @amit-s19 | @ChakshuGautam,

I am want to provide an update on the current status of my take on this project. I have successfully implemented the foundational components, which include:

  1. Extracting text from stored PDFs.
  2. Translating text from Hindi to English.
  3. Storing data in an unstructured manner in a vector database using Pinecone.
  4. Developing a pipeline to ingest PDFs into the vector database.
  5. Creating a basic chat frontend.
  6. Utilizing Gemini to retrieve context data from the database and formulate responses with built-in memory functionality.

As part of this update, I have attached images showcasing the chat interface for your reference.

Example 1 Example 2 Example 3

Additionally, I have created a repository of my implementation, accessible here.

Moving forward, I seek your guidance on the next steps for the project. Specifically, I am considering the following options:

  1. Implementing a scheduled scraper to continuously expand the database with new data.
  2. Exploring the possibility of caching chat history for each user to enhance user experience and improve response accuracy.
  3. Or the language translation of the responses.

Your feedback and further guidance on prioritizing these options or suggesting alternative paths would be greatly appreciated.

Thankyou.

RSN601KRI commented 2 months ago

I hope you're doing well. I wanted to express my keen interest in contributing to the Government Data Chatbot project. After reviewing the project details, I'm excited about the opportunity to work on tasks such as data collection, natural language processing, and content structuring. I believe my skills in [relevant skills or experience] make me well-suited for this project. I'm eager to collaborate with the team, learn from your guidance, and explore ways to enhance the chatbot's functionality. @ChakshuGautam Please let me know how I can formally join the project team and start contributing.

Looking forward to your response.

AbhimanyuSamagra commented 2 months ago

Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries. Here's a Video Tutorial on how to submit a proposal for a project.