gbad-project / gbad-project.github.io

Graph-Based Archival Description Project’s website, a fork of Démonstrateur Sparnatural des Archives nationales de France.
https://gbad-project.github.io/
0 stars 0 forks source link

Subseries of Walkerton Inquiry: finding aid is a website, but for some other records finding aid is something else #3

Open pvzhelnov opened 1 month ago

PeiwenZhang commented 1 week ago

Hi Pavel, could you specify the finding aid here? What does finding aid represent? P.S. I have a feeling that this could be resolved by the AOO team since it's related to the original raw data issue.

pvzhelnov commented 1 week ago

Hi Peiwen, a finding aid here refers to FINDAID in the FINDAID:FINDAIDLINK:FINDAID_URL column of the original DESCRIPTION.CSV. The "Data Structure Documentation" folder in the cloud drive contains some additional insights about these fields.

The specific issue that the AO team raised here is that the contents of that field are diverse, and a set of different rules will need to be developed and applied in order to parse FINDAID comprehensively. Data-wise, we could provide the AO team with an overview of values present in that column. I think a large language model would be in the best position to extract these patterns from the data. Let’s discuss this issue more at the next meeting.

pvzhelnov commented 4 days ago

Some stuff could be done here such as simple parsing out of URLs, with or without natural language processing of content to get a better idea of whether or not online resources exist or not for that record. Probably a good fit for SPARQL with regex processing for starters. It is not clear at this point, however, how many useful links we can extract, so we’ll take care of it a bit later.