ELVIS-Project / simssadb

New version of the ELVIS database. A database of files containing searchable symbolic representations of scores. See staging at db.staging.simmsa.ca.
https://db.simssa.ca
GNU General Public License v3.0
5 stars 4 forks source link

Add back wikidata #395

Open rebmizrahi opened 1 year ago

rebmizrahi commented 1 year ago

There should be data in the DB for geographic area, genres, etc. from wikidata/VIAF. This was implemented in 2019 but is not in the current repo, and I don't think it was ever completed (the latest commit messages are 'WIP', to look into). I think this would be very useful to avoid duplicates with the upload form and ensure a base amount of metadata exists for each newly uploaded file. Upload form should probably have a dropdown select menu, or an autocomplete with choices?

fujinaga commented 1 year ago

Yes, that would be great. It will be a nice segue to LinkedMusic as well.

codaich commented 1 year ago

One of the issues that came up in the early discussions with Julie and other musicologists is that expert annotators might want to be able to use regional and genre annotations that might not conform to fixed vocabularies, and in particular might need to incorporate uncertainty or non-standard vocabularies. With respect to geographic area for example, annotations like "the court of Francis I", which could move, or something like "probably Bruges but maybe Ghent" might be appropriate, rather than specific geopolitical entities in space-time). With respect to dates, ranges must be possible, like "1525 to 1540", not just references to specific dates. Genre might be more tractable, but please be sure to differentiate between "Genre (Type of Work)" and "Genre (Style)," and to reference the extended discussions on genre vocabularies and ontological structuring documented in Teamwork. So, overall, references to standard vocabularies should be made possible and encouraged (e.g. via text auto-complete), but the additional possibility of free-text entries were emphasized as essential during our consultations with musicologists.

fujinaga commented 1 year ago

Yes, exactly!! These are some of the core challenges in the LinkedMusic project.

ahankinson commented 1 year ago

Free text entries are fine for humans to read, but it can be hard to do computation on them if they aren't formalized somehow. This can include linking records together, particularly for places. If you have "probably Bruges but maybe Ghent" your users, and definitely other computers, would want them to appear as links to both a "Bruges" authority and a "Ghent" authority, possibly with some indicator that qualifies or quantifies the relationship as questionable. Otherwise, all you have is a text string that doesn't actually link the source record to either, making reverse lookups and prosopography / provenance queries very difficult.

Questionable dates are definitely necessary, but formalizing them somehow is useful. I would recommend looking into supporting the Extended Date-Time Format (EDTF). https://www.loc.gov/standards/datetime/

The way to do this can be very simple, but you do need to signal it to your users. The method I use is to make formalized statements required (e.g., "hard" links to places or formalized methods of date input with validation), while making human-readable statements optional as a typed form of note. This way users have the ability to add extra context for other humans to read, but the formalized statements take precedence.

I would say this is particularly important for SIMSSA DB, where the data will be used for computational analysis. You don't want to get to the end of the project and find that you have no way of doing geographic or temporal analysis because your users have entered places and dates in 20 different ways!

Don't forget that any website that has the ability to publish something online also has the possibility of being an authority for something! If you see a gap in knowledge -- such as a formalized list of genres, in their various incarnations and meanings -- then you can share your authorities by publishing URLs and URIs for them so that others can use them as URIs in their own projects.

So you should think of your project as publishing its own authorities, so that people can, for example, visit a "Places" or a "Genres" page and find all the reverse links within your own database. From there it's not a huge stretch to think of people using your URIs in their own database to do similar groupings.

codaich commented 1 year ago

Thanks for the input, Andrew. The EDTF format in particular is something we should for sure look into, thanks.

Whether or not to allow contributors the option of free text entries is a good example of the tension between what developers would ideally like and what the specific target users would ideally like, I think. Although forbidding free text certainly does have advantages for search and linking of resources, as you say, the consultations we had with early music scholars earlier on made it clear that having free text as an option for certain fields was essential to them, and there was also concern that requiring too much structuring or obligatory formatting on submissions would be too onerous for some expert volunteer contributors of data and metadata. The best compromise seemed to be to encourage controlled vocabularies and URIs as much as possible, and to make them as accessible as possible for contributors, but not to eliminate the option of free text entirely.

This could change if we end up hiring our own curators to enter and annotate information (or restructure / reformat it), however, as opposed to making the SIMSSA DB entirely an independently user-contributed repository that expert scholars voluntarily contribute to; such repurposing of the SIMSSA DB is still an option under consideration, but that would also require at least medium-term financing that may not be available.

fujinaga commented 1 year ago

We can keep both: the free text and Linked Data. I don't know if you remember but about 10 years ago I had this Human History Project, where I wanted to extract all named entities and relationships from Wikipedia and other sources automatically. Well, we're getting very close (e.g., ChatGPT); so we can have musicologists enter the text and semi-automatically covert them to Linked Data!