WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
215 stars 177 forks source link

Handle duplication of records between auckland_museum and wikimedia #3659

Open stacimc opened 5 months ago

stacimc commented 5 months ago

Problem

As noted by @sarayourfriend in this comment, many records from the Auckland Museum's collection are already in Openverse due to their inclusion in Wikimedia Commons. If we run both DAGs and do nothing to address this, these records will be duplicated in Openverse.

Description

Suggestion taken directly from Sara's comment:

Either we'd need to suppress the entries from Wikimedia Commons, or, (probably my preference) improve our ingestion of Wikimedia Commons to be able to identify sources like this in Wikimedia Commons. Glancing at the Wikimedia Commons provider script, I don't think we currently save the "collections" metadata present in the file summary on the Wikimedia Commons page.

I think this is a big opportunity to expand the list of high quality sources without introducing duplicates, and while cleaning up the Wikimedia Commons data ingestion, cleanup, and overall handling. For this institution in particular, there is a great page describing how the metadata is structured: https://commons.wikimedia.org/wiki/Commons:Batch_uploading/AucklandMuseumCCBY

The same information would also be relevant for the National Gallery of Art (https://github.com/WordPress/openverse/issues/3167) (see this Wikimedia result, which is in Openverse with similarly poorly handled metadata and is in a NGA collection in Wikimedia Commons's data). I imagine there are a handful of other such institutions that we could add, just by improving the Wikimedia Commons script and our handling of their data.

And actually, when digging through Wikimedia Commons and Wikidata pages researching this comment, I found this amazing spreadsheet that would help us identify these exact kinds of institutions, for Wikimedia Commons, Europeana, Flickr, and even TROVE (https://github.com/WordPress/openverse/issues/2653): https://docs.google.com/spreadsheets/d/1WPS-KJptUJ-o8SXtg00llcxq0IKJu8eO6Ege_GrLaNc/edit#gid=1216556120

Additional context

The auckland_museum DAG is currently blocked on other issues (see DAG Status page), but this issue should not necessarily prevent us from turning the DAG on.

However, we should not add the provider as a source in the API until this has been resolved.

sarayourfriend commented 5 months ago

Just adding a note that it would be great to do this in a way that lets us also identify other big sources within Wikimedia Commons, like the NGA (#3167) or others in the spreadsheet I linked in the comment Staci quoted: https://docs.google.com/spreadsheets/d/1WPS-KJptUJ-o8SXtg00llcxq0IKJu8eO6Ege_GrLaNc/edit#gid=1216556120

Specifically, getting the "Collection" metadata from Wikimedia Commons (probably into the meta_data blog?) would allow deduplicating Auckland Museum Tamaki Paenga Hira (because we can identify the Wikimedia Commons records to suppress in favour of the first party ones) and also make additional source querying possible in the future.

This would be in contrast to not storing that metadata and just reading it to exclude these records.

stacimc commented 2 weeks ago

We may also need to identify duplicated records uploaded from Flickr (https://www.flickr.org/introducing-flickypedia/).