Open karthik18495 opened 1 month ago
I will be interested in Indico part .
Indico API doc: https://docs.getindico.io/en/stable/http-api/
To get info about a event:
curl --header "'Authorization': 'Bearer <token>'" 'https://indico.bnl.gov/export/event/24778.json?detail=contributions&pretty=yes'
Example output shown below. From here we can get the pdf/slides if exits uploaded into the event. Also can get the meeting notes. The format of file can vary (ex. someone will link pptx ) , thus should be handled. Event number can be unquie key for document to index into vectordDB. Category number can be used to make the documents linked to same working group meetings.
{
"count":1,
"additionalInfo":{},
"ts":1729866972,
"url":"https:\/\/indico.bnl.gov\/export\/event\/24778.json?detail=contributions&pretty=yes",
"results":[
{
"_type":"Conference",
"id":"24778",
"title":"ePIC Streaming WG meeting: Review preparation",
"description":"<p>ePIC streaming computing model meetings are organized by the joint DAQ \/ S&C Streaming Computing Model Working Group. Meetings are\u00a0nominally weekly, Tuesdays at 9am eastern. They address streaming readout (SRO) and all aspects of the ePIC streaming computing model.\u00a0<\/p>\r\n<p>See the <a href=\"https:\/\/docs.google.com\/document\/d\/1t5vBfgro8Kb6MKc-bz2Y67u3cOCpHK4dfepbJX-nEbE\/edit?usp=sharing\">meeting notes page<\/a> for agendas prepared in advance and live notes.<\/p>\r\n<ul>\r\n<li><a href=\"https:\/\/jlab-org.zoomgov.com\/j\/1614875218?pwd=RFRPcGlNM3BaS0pQaDhxS3JURkdJZz09\">https:\/\/jlab-org.zoomgov.com\/j\/1614875218?pwd=RFRPcGlNM3BaS0pQaDhxS3JURkdJZz09<\/a>\u00a0<\/li>\r\n<li>Meeting ID: 1614875218<\/li>\r\n<li>Password: 925723<\/li>\r\n<\/ul>",
"startDate":{
"date":"2024-09-10",
"time":"09:00:00",
"tz":"America\/New_York"
},
"timezone":"US\/Eastern",
"endDate":{
"date":"2024-09-10",
"time":"10:00:00",
"tz":"America\/New_York"
},
"room":"",
"location":"",
"address":"",
"type":"meeting",
"references":[],
"_fossil":"conferenceMetadataWithContribs",
"categoryId":463,
"category":"Working Group Meetings",
"note":{},
"roomFullname":"",
"url":"https:\/\/indico.bnl.gov\/event\/24778\/",
"creationDate":{
"date":"2024-09-06",
"time":"06:59:54.129918",
"tz":"America\/New_York"
},
"creator":{
"_type":"Avatar",
"_fossil":"conferenceChairMetadata",
"first_name":"Torre",
"last_name":"Wenaus",
"fullName":"Wenaus, Torre",
"id":"591",
"affiliation":"BNL",
"emailHash":"849637192af92a0f322682b2abc1e859"
},
"hasAnyProtection":false,
"roomMapURL":"",
"folders":[],
"chairs":[],
"material":[],
"keywords":[],
"organizer":"",
"language":null,
"label":null,
"visibility":{
"id":"",
"name":"Everywhere"
},
"contributions":[
{
"_type":"Contribution",
"_fossil":"contributionMetadata",
"id":"1",
"db_id":96379,
"friendly_id":1,
"title":"Review preparation topics",
"startDate":{
"date":"2024-09-10",
"time":"09:05:00",
"tz":"America\/New_York"
},
"endDate":{
"date":"2024-09-10",
"time":"10:00:00",
"tz":"America\/New_York"
},
"duration":55,
"roomFullname":"",
"room":"",
"note":{},
"location":"",
"type":null,
"description":"See the meeting notes",
"folders":[
{
"_type":"folder",
"id":57080,
"title":null,
"description":"",
"attachments":[
{
"_type":"attachment",
"id":97957,
"download_url":"https:\/\/indico.bnl.gov\/event\/24778\/contributions\/96379\/attachments\/57080\/97957\/16-Computing-Aug-16.pdf",
"title":"16-Computing-Aug-16.pdf",
"description":"",
"modified_dt":"2024-09-06T11:44:46.927900+00:00",
"type":"file",
"is_protected":false,
"filename":"16-Computing-Aug-16.pdf",
"content_type":"application\/pdf",
"size":1786845,
"checksum":"011b8525108f15f8e034117c1851220a"
},
{
"_type":"attachment",
"id":97956,
"download_url":"https:\/\/indico.bnl.gov\/event\/24778\/contributions\/96379\/attachments\/57080\/97956\/Jeff%20-%20Plan%20for%20integration-test-installation%20of%20DAQ.pdf",
"title":"Jeff - Plan for integration-test-installation of DAQ.pdf",
"description":"",
"modified_dt":"2024-09-06T11:43:03.290100+00:00",
"type":"file",
"is_protected":false,
"filename":"Jeff - Plan for integration-test-installation of DAQ.pdf",
"content_type":"application\/pdf",
"size":1628066,
"checksum":"c74d81e89d20a9ec945f75f60133ad89"
}
],
"default_folder":true,
"is_protected":false
}
],
"url":"https:\/\/indico.bnl.gov\/event\/24778\/contributions\/96379\/",
"material":[],
"speakers":[],
"primaryauthors":[],
"coauthors":[],
"keywords":[],
"track":null,
"session":null,
"references":[],
"board_number":"",
"code":""
},
{
"_type":"Contribution",
"_fossil":"contributionMetadata",
"id":"2",
"db_id":96378,
"friendly_id":2,
"title":"Top of the meeting",
"startDate":{
"date":"2024-09-10",
"time":"09:00:00",
"tz":"America\/New_York"
},
"endDate":{
"date":"2024-09-10",
"time":"09:05:00",
"tz":"America\/New_York"
},
"duration":5,
"roomFullname":"",
"room":"",
"note":{},
"location":"",
"type":null,
"description":"",
"folders":[],
"url":"https:\/\/indico.bnl.gov\/event\/24778\/contributions\/96378\/",
"material":[],
"speakers":[
{
"_type":"ContributionParticipation",
"_fossil":"contributionParticipationMetadata",
"first_name":"Jeff",
"last_name":"Landgraf",
"fullName":"Landgraf, Jeff",
"id":"132875",
"affiliation":"Brookhaven National Laboratory",
"emailHash":"b9e019dd3a816b634573c9195148248e",
"db_id":132875,
"person_id":120009
},
{
"_type":"ContributionParticipation",
"_fossil":"contributionParticipationMetadata",
"first_name":"Jin",
"last_name":"Huang",
"fullName":"Huang, Jin",
"id":"132876",
"affiliation":"Brookhaven National Lab",
"emailHash":"ed1a76b6b45733398a0cf74409943bbe",
"db_id":132876,
"person_id":120010
},
{
"_type":"ContributionParticipation",
"_fossil":"contributionParticipationMetadata",
"first_name":"Marco",
"last_name":"Battaglieri",
"fullName":"Battaglieri, Marco",
"id":"132877",
"affiliation":"Jefferson Lab",
"emailHash":"074082959c419e8690b95665f93b7d44",
"db_id":132877,
"person_id":120011
},
{
"_type":"ContributionParticipation",
"_fossil":"contributionParticipationMetadata",
"first_name":"Markus",
"last_name":"Diefenthaler",
"fullName":"Diefenthaler, Markus",
"id":"132998",
"affiliation":"Jefferson Lab",
"emailHash":"af834e0ec8ce37c7bc6e53f554561c99",
"db_id":132998,
"person_id":120110
},
{
"_type":"ContributionParticipation",
"_fossil":"contributionParticipationMetadata",
"first_name":"Torre",
"last_name":"Wenaus",
"fullName":"Wenaus, Torre",
"id":"132879",
"affiliation":"BNL",
"emailHash":"849637192af92a0f322682b2abc1e859",
"db_id":132879,
"person_id":120013
}
],
"primaryauthors":[],
"coauthors":[],
"keywords":[],
"track":null,
"session":null,
"references":[],
"board_number":"",
"code":""
}
]
}
],
"_type":"HTTPAPIResult"
}%
Could we include this as a utils tool during ingestion. I am thinking to reorganize the ingestion with the folder utils where we can have these scrappers built in?
Yes. Let me think about how to add this.
GitHub Issue: Extend Vector Database with Public and Indico Pages for EIC Information
Issue Title:
Extend the Vector Database to Include Information from Public and Indico Pages for EIC
Description:
The Electron-Ion Collider (EIC) project would benefit from expanding the existing vector database to include data from public and Indico pages. This will enhance the system's ability to retrieve relevant documents and presentations for users. By incorporating these sources, we can provide a more comprehensive dataset for retrieval and improve the contextual quality of the responses.
This issue proposes:
Use Case:
Users can query the extended vector database to retrieve specific EIC-related documents, presentations, or meeting notes, allowing them to discover both internal and public information from Indico and other public sources.
Tasks:
Data Collection:
Preprocessing:
PyMuPDF
orpdfminer
.Vectorization:
text-embedding-ada-002
) to generate vector embeddings for the text data.Indexing in Vector Database:
Testing and Validation:
Proposed Code Changes:
API Integration: Extend the current codebase for ingestion to integrate with the Indico API for fetching relevant events and document data.
Vectorization Pipeline: Modify the existing preprocessing and vectorization pipeline to handle documents from both public and Indico sources.
Database Update: Adjust the database schema to accommodate new metadata fields such as
source
andevent_date
.References:
Priority:
Medium - Enhancing the vector database with these sources will greatly improve the overall retrieval quality and allow users to access a broader range of documents and presentations.