ai4eic / EIC-RAG-Project

A RAG based Chatbot for the Electron Ion Collider
https://rags4eic-ai4eic.streamlit.app/
5 stars 1 forks source link

Extend the vector database. #2

Open karthik18495 opened 1 month ago

karthik18495 commented 1 month ago

GitHub Issue: Extend Vector Database with Public and Indico Pages for EIC Information

Issue Title:
Extend the Vector Database to Include Information from Public and Indico Pages for EIC


Description:

The Electron-Ion Collider (EIC) project would benefit from expanding the existing vector database to include data from public and Indico pages. This will enhance the system's ability to retrieve relevant documents and presentations for users. By incorporating these sources, we can provide a more comprehensive dataset for retrieval and improve the contextual quality of the responses.

This issue proposes:

Use Case:

Users can query the extended vector database to retrieve specific EIC-related documents, presentations, or meeting notes, allowing them to discover both internal and public information from Indico and other public sources.


Tasks:

  1. Data Collection:

    • Set up scripts to fetch data from Indico using its API, extracting event details, presentations, and associated documents.
    • Scrape public pages relevant to EIC (e.g., documentation pages, wikis) for documents, presentations, and other useful content.
  2. Preprocessing:

    • Convert documents from multiple formats (PDFs, Word, HTML) into plain text using libraries like PyMuPDF or pdfminer.
    • Apply NLP preprocessing steps: tokenization, stop-word removal, lemmatization.
  3. Vectorization:

    • Use the existing transformer-based model (e.g., text-embedding-ada-002) to generate vector embeddings for the text data.
    • Ensure that each embedding is stored with metadata, such as title, source (Indico/public), and date.
  4. Indexing in Vector Database:

    • Update the vector database schema to include these new data sources.
    • Insert the new embeddings and metadata into the database in Pinecone.
  5. Testing and Validation:

    • Test queries to ensure proper retrieval of relevant documents from both internal and newly added public sources.
    • Validate accuracy and relevance of the results to ensure the system is functioning as expected.

Proposed Code Changes:


References:


Priority:

Medium - Enhancing the vector database with these sources will greatly improve the overall retrieval quality and allow users to access a broader range of documents and presentations.

panta-123 commented 1 month ago

I will be interested in Indico part .

panta-123 commented 1 month ago

Indico API doc: https://docs.getindico.io/en/stable/http-api/

To get info about a event:

curl --header "'Authorization': 'Bearer <token>'" 'https://indico.bnl.gov/export/event/24778.json?detail=contributions&pretty=yes'

Example output shown below. From here we can get the pdf/slides if exits uploaded into the event. Also can get the meeting notes. The format of file can vary (ex. someone will link pptx ) , thus should be handled. Event number can be unquie key for document to index into vectordDB. Category number can be used to make the documents linked to same working group meetings.

{
    "count":1,
    "additionalInfo":{},
    "ts":1729866972,
    "url":"https:\/\/indico.bnl.gov\/export\/event\/24778.json?detail=contributions&pretty=yes",
    "results":[
        {
            "_type":"Conference",
            "id":"24778",
            "title":"ePIC Streaming WG meeting: Review preparation",
            "description":"<p>ePIC streaming computing model meetings are organized by the joint DAQ \/ S&amp;C Streaming Computing Model Working Group. Meetings are\u00a0nominally weekly, Tuesdays at 9am eastern. They address streaming readout (SRO) and all aspects of the ePIC streaming computing model.\u00a0<\/p>\r\n<p>See the <a href=\"https:\/\/docs.google.com\/document\/d\/1t5vBfgro8Kb6MKc-bz2Y67u3cOCpHK4dfepbJX-nEbE\/edit?usp=sharing\">meeting notes page<\/a> for agendas prepared in advance and live notes.<\/p>\r\n<ul>\r\n<li><a href=\"https:\/\/jlab-org.zoomgov.com\/j\/1614875218?pwd=RFRPcGlNM3BaS0pQaDhxS3JURkdJZz09\">https:\/\/jlab-org.zoomgov.com\/j\/1614875218?pwd=RFRPcGlNM3BaS0pQaDhxS3JURkdJZz09<\/a>\u00a0<\/li>\r\n<li>Meeting ID: 1614875218<\/li>\r\n<li>Password: 925723<\/li>\r\n<\/ul>",
            "startDate":{
                "date":"2024-09-10",
                "time":"09:00:00",
                "tz":"America\/New_York"
            },
            "timezone":"US\/Eastern",
            "endDate":{
                "date":"2024-09-10",
                "time":"10:00:00",
                "tz":"America\/New_York"
            },
            "room":"",
            "location":"",
            "address":"",
            "type":"meeting",
            "references":[],
            "_fossil":"conferenceMetadataWithContribs",
            "categoryId":463,
            "category":"Working Group Meetings",
            "note":{},
            "roomFullname":"",
            "url":"https:\/\/indico.bnl.gov\/event\/24778\/",
            "creationDate":{
                "date":"2024-09-06",
                "time":"06:59:54.129918",
                "tz":"America\/New_York"
            },
            "creator":{
                "_type":"Avatar",
                "_fossil":"conferenceChairMetadata",
                "first_name":"Torre",
                "last_name":"Wenaus",
                "fullName":"Wenaus, Torre",
                "id":"591",
                "affiliation":"BNL",
                "emailHash":"849637192af92a0f322682b2abc1e859"
            },
            "hasAnyProtection":false,
            "roomMapURL":"",
            "folders":[],
            "chairs":[],
            "material":[],
            "keywords":[],
            "organizer":"",
            "language":null,
            "label":null,
            "visibility":{
                "id":"",
                "name":"Everywhere"
            },
            "contributions":[
                {
                    "_type":"Contribution",
                    "_fossil":"contributionMetadata",
                    "id":"1",
                    "db_id":96379,
                    "friendly_id":1,
                    "title":"Review preparation topics",
                    "startDate":{
                        "date":"2024-09-10",
                        "time":"09:05:00",
                        "tz":"America\/New_York"
                    },
                    "endDate":{
                        "date":"2024-09-10",
                        "time":"10:00:00",
                        "tz":"America\/New_York"
                    },
                    "duration":55,
                    "roomFullname":"",
                    "room":"",
                    "note":{},
                    "location":"",
                    "type":null,
                    "description":"See the meeting notes",
                    "folders":[
                        {
                            "_type":"folder",
                            "id":57080,
                            "title":null,
                            "description":"",
                            "attachments":[
                                {
                                    "_type":"attachment",
                                    "id":97957,
                                    "download_url":"https:\/\/indico.bnl.gov\/event\/24778\/contributions\/96379\/attachments\/57080\/97957\/16-Computing-Aug-16.pdf",
                                    "title":"16-Computing-Aug-16.pdf",
                                    "description":"",
                                    "modified_dt":"2024-09-06T11:44:46.927900+00:00",
                                    "type":"file",
                                    "is_protected":false,
                                    "filename":"16-Computing-Aug-16.pdf",
                                    "content_type":"application\/pdf",
                                    "size":1786845,
                                    "checksum":"011b8525108f15f8e034117c1851220a"
                                },
                                {
                                    "_type":"attachment",
                                    "id":97956,
                                    "download_url":"https:\/\/indico.bnl.gov\/event\/24778\/contributions\/96379\/attachments\/57080\/97956\/Jeff%20-%20Plan%20for%20integration-test-installation%20of%20DAQ.pdf",
                                    "title":"Jeff - Plan for integration-test-installation of DAQ.pdf",
                                    "description":"",
                                    "modified_dt":"2024-09-06T11:43:03.290100+00:00",
                                    "type":"file",
                                    "is_protected":false,
                                    "filename":"Jeff - Plan for integration-test-installation of DAQ.pdf",
                                    "content_type":"application\/pdf",
                                    "size":1628066,
                                    "checksum":"c74d81e89d20a9ec945f75f60133ad89"
                                }
                            ],
                            "default_folder":true,
                            "is_protected":false
                        }
                    ],
                    "url":"https:\/\/indico.bnl.gov\/event\/24778\/contributions\/96379\/",
                    "material":[],
                    "speakers":[],
                    "primaryauthors":[],
                    "coauthors":[],
                    "keywords":[],
                    "track":null,
                    "session":null,
                    "references":[],
                    "board_number":"",
                    "code":""
                },
                {
                    "_type":"Contribution",
                    "_fossil":"contributionMetadata",
                    "id":"2",
                    "db_id":96378,
                    "friendly_id":2,
                    "title":"Top of the meeting",
                    "startDate":{
                        "date":"2024-09-10",
                        "time":"09:00:00",
                        "tz":"America\/New_York"
                    },
                    "endDate":{
                        "date":"2024-09-10",
                        "time":"09:05:00",
                        "tz":"America\/New_York"
                    },
                    "duration":5,
                    "roomFullname":"",
                    "room":"",
                    "note":{},
                    "location":"",
                    "type":null,
                    "description":"",
                    "folders":[],
                    "url":"https:\/\/indico.bnl.gov\/event\/24778\/contributions\/96378\/",
                    "material":[],
                    "speakers":[
                        {
                            "_type":"ContributionParticipation",
                            "_fossil":"contributionParticipationMetadata",
                            "first_name":"Jeff",
                            "last_name":"Landgraf",
                            "fullName":"Landgraf, Jeff",
                            "id":"132875",
                            "affiliation":"Brookhaven National Laboratory",
                            "emailHash":"b9e019dd3a816b634573c9195148248e",
                            "db_id":132875,
                            "person_id":120009
                        },
                        {
                            "_type":"ContributionParticipation",
                            "_fossil":"contributionParticipationMetadata",
                            "first_name":"Jin",
                            "last_name":"Huang",
                            "fullName":"Huang, Jin",
                            "id":"132876",
                            "affiliation":"Brookhaven National Lab",
                            "emailHash":"ed1a76b6b45733398a0cf74409943bbe",
                            "db_id":132876,
                            "person_id":120010
                        },
                        {
                            "_type":"ContributionParticipation",
                            "_fossil":"contributionParticipationMetadata",
                            "first_name":"Marco",
                            "last_name":"Battaglieri",
                            "fullName":"Battaglieri, Marco",
                            "id":"132877",
                            "affiliation":"Jefferson Lab",
                            "emailHash":"074082959c419e8690b95665f93b7d44",
                            "db_id":132877,
                            "person_id":120011
                        },
                        {
                            "_type":"ContributionParticipation",
                            "_fossil":"contributionParticipationMetadata",
                            "first_name":"Markus",
                            "last_name":"Diefenthaler",
                            "fullName":"Diefenthaler, Markus",
                            "id":"132998",
                            "affiliation":"Jefferson Lab",
                            "emailHash":"af834e0ec8ce37c7bc6e53f554561c99",
                            "db_id":132998,
                            "person_id":120110
                        },
                        {
                            "_type":"ContributionParticipation",
                            "_fossil":"contributionParticipationMetadata",
                            "first_name":"Torre",
                            "last_name":"Wenaus",
                            "fullName":"Wenaus, Torre",
                            "id":"132879",
                            "affiliation":"BNL",
                            "emailHash":"849637192af92a0f322682b2abc1e859",
                            "db_id":132879,
                            "person_id":120013
                        }
                    ],
                    "primaryauthors":[],
                    "coauthors":[],
                    "keywords":[],
                    "track":null,
                    "session":null,
                    "references":[],
                    "board_number":"",
                    "code":""
                }
            ]
        }
    ],
    "_type":"HTTPAPIResult"
}%                                                                                                                                                                                
karthik18495 commented 1 month ago

Could we include this as a utils tool during ingestion. I am thinking to reorganize the ingestion with the folder utils where we can have these scrappers built in?

panta-123 commented 1 month ago

Yes. Let me think about how to add this.