HeardLibrary / vandycite

0 stars 0 forks source link

apply text analysis and cleaning code to generate Wikidata metadata #3

Closed baskaufs closed 2 years ago

baskaufs commented 3 years ago

Titles (goal is for every work to have a title, really a cleaning exercise): First step is to identify what the problems are and whether they are bad enough that we should try to script a solution or just manually edit. correct languages, figure out languages for those without. Identify parts of titles that are actually translations and remove to use for labels in other languages. Compare titles with labels to see if there is any pattern for mistakes where the titles are in en but got missed. See if there is any pattern for titles identified as en but which are actually fr or some other language. How to handle titles that have a non-en title, but translation in parens -- can we use that to build labels in other languages?

Depicts (goal is to see how many things we can figure out automatically) Named entity recognition from labels -- identify whether depicted or the artist. Match to Wikidata items and add value for depicts. Perhaps use some kind of fuzzy matching with artist labels. Parts of speech ID: What about entities like "horse" that aren't named? Is there a way to detect nouns? Can we get "oak chest" rather than just "chest"?

Instance Of (goal is to apply more specific items): correcting and making more specific the InstanceOf values for the pieces. In particular, "work of art" needs to be refined. "image" should be photograph, print, painting, etc. Look at class hierarchy for classes used. Do the most specific ones (ones with a single instance) have parent classes that include the broader classes with many instances (like "painting")? Does "print" (the largest category) actually have narrower subclasses that would be more specific?

Country/country of origin: Some have been changed by bots. Which ones? Is this distinction useful? Do we need both?

Materials: materials from descriptions, medium, substrate. Techniques mixed with materials ("Etching on silk"). There is a Material Used spreadsheet, but most works don't have values. Not sure if it's possible to automate the more specific medium and substrate unless there are data in the original source with more info that I didn't use.

Automated translation/generation: labels for artists: Pull from ULAN? Other structured data sources? Automate translation of labels for Asian works into Mandarin with QC by Fellows.

baskaufs commented 3 years ago

The text that comes out at the bottom of this workflow is what we can use to disambiguated to use for these properties:

baskaufs commented 2 years ago

Divided into sub issues