Open pvzhelnov opened 1 month ago
Hi Peiwen, a finding aid here refers to FINDAID in the FINDAID:FINDAIDLINK:FINDAID_URL column of the original DESCRIPTION.CSV. The "Data Structure Documentation" folder in the cloud drive contains some additional insights about these fields.
The specific issue that the AO team raised here is that the contents of that field are diverse, and a set of different rules will need to be developed and applied in order to parse FINDAID comprehensively. Data-wise, we could provide the AO team with an overview of values present in that column. I think a large language model would be in the best position to extract these patterns from the data. Let’s discuss this issue more at the next meeting.
Some stuff could be done here such as simple parsing out of URLs, with or without natural language processing of content to get a better idea of whether or not online resources exist or not for that record. Probably a good fit for SPARQL with regex processing for starters. It is not clear at this point, however, how many useful links we can extract, so we’ll take care of it a bit later.
Hi Pavel, could you specify the finding aid here? What does finding aid represent? P.S. I have a feeling that this could be resolved by the AOO team since it's related to the original raw data issue.