Open stacimc opened 5 months ago
Just adding a note that it would be great to do this in a way that lets us also identify other big sources within Wikimedia Commons, like the NGA (#3167) or others in the spreadsheet I linked in the comment Staci quoted: https://docs.google.com/spreadsheets/d/1WPS-KJptUJ-o8SXtg00llcxq0IKJu8eO6Ege_GrLaNc/edit#gid=1216556120
Specifically, getting the "Collection" metadata from Wikimedia Commons (probably into the meta_data
blog?) would allow deduplicating Auckland Museum Tamaki Paenga Hira (because we can identify the Wikimedia Commons records to suppress in favour of the first party ones) and also make additional source querying possible in the future.
This would be in contrast to not storing that metadata and just reading it to exclude these records.
We may also need to identify duplicated records uploaded from Flickr (https://www.flickr.org/introducing-flickypedia/).
Problem
As noted by @sarayourfriend in this comment, many records from the Auckland Museum's collection are already in Openverse due to their inclusion in Wikimedia Commons. If we run both DAGs and do nothing to address this, these records will be duplicated in Openverse.
Description
Suggestion taken directly from Sara's comment:
Additional context
The
auckland_museum
DAG is currently blocked on other issues (see DAG Status page), but this issue should not necessarily prevent us from turning the DAG on.However, we should not add the provider as a source in the API until this has been resolved.