HeardLibrary / vandycite

0 stars 0 forks source link

Write a workflow document listing the steps to be taken during the process of creating Wikidata items for Commons artworks. #43

Closed baskaufs closed 2 years ago

baskaufs commented 2 years ago

This document needs to include whatever was learned from the trials. Note the conclusion of Issue #17 about titles.

baskaufs commented 2 years ago
  1. There should be no works left that have ACT IDs pointing to items that aren't artwork items. But check this.
  2. The next stage of the project is to create artwork items for Commons images that are listed as sources in the ACT records. This excludes any commons items that already are linked from Wikidata items. The file [add_to_wikidata.csv] should contain these artworks, but it needs to be checked again to remove any artworks that were recently linked to new or existing Wikidata items.
  3. The remaining items should probably be double-checked to make sure that none of them have Wikidata links on their commons pages to make sure we don't accidentally make duplicate items. If any are found, the item pages should be examined to see if they are artworks (likely) or non-artworks. If artworks, then manually make the ACT ID link. If non-artworks, manually create an artwork item page and link it.
  4. There is an edge case where a Commons image does not have a link to a Wikidata artwork, but is a derivative (crop or black and white version] of another Commons image that does. In this case, we would erroneously create an artwork item for the derivative (basically creating a duplicate artwork item), when actually that image should probably just be set aside until we decide how to handle derivative cases. This happened on an image of some shoes that were a crop of a larger, famous wedding painting and was picked up by the Sum of All Paintings people and flagged.
  5. Once we have a clean list of Commons works to create Wikidata items for, we need to run the script to pull the data from Commons and the ACT database as we did before and run the same cleanup process. This will be a much larger list. This will involve the same checklist in https://github.com/HeardLibrary/vandycite/issues/9 .
baskaufs commented 2 years ago

@baskaufs will work through these steps and generate a spreadsheet to use for quality control.

baskaufs commented 2 years ago

In the broadest terms:

The spreadsheet clean_ids.csv should be the starting point of determining what needs to be uploaded. It probably needs to be double-checked for duplicates.

This can be cross-checked against the works_already_in_wikidata.csv file, which I think is up-to-date, but needs to be confirmed with a SPARQL query. These works can be removed from the list above to determine the candidates for addition.

Those IDs can then be used to extract appropriate data from act_all_202109241353_repaired.csv, the clean data source from the ACT data dump. That extracted data would be combined with data downloaded from Commons using the script: [?], to create the combined table for proofreading.

I've summarized the files and how they were used in the first part of Phase 2 on the create_items folder README.

baskaufs commented 2 years ago

I used this script to start with the clean_ids.csv file and search for duplicates. I found a few that were missed in the cleaned_output.csv file, and removed them to the 'duplicates_of_existing_commons_ids.csv` file.

So the cleaned_output.csv really should be the starting point, since it's got duplicates removed. When it was created, all of it's URLs were checked to make sure that they dereferenced. This should be done again in case any others were taken down since it was last checked.

The script also does a SPARQL query to get the items with ACT ID statements. This should be used in preference to works_already_in_wikidata.csv since that file seems to be missing a few works.

baskaufs commented 2 years ago

This issue has either been completed or has parts transferred to https://github.com/HeardLibrary/vandycite/issues/53