HeardLibrary / vandycite

0 stars 0 forks source link

Sort out licensing for works in ACT #94

Open baskaufs opened 1 year ago

baskaufs commented 1 year ago

Charlotte has a dump of the copyright field, with the ACT ID, copyright statement (i.e. image source info) and copyright permission (CC, or other statement). What needs to happen is:

baskaufs commented 1 year ago

Did some preliminary work on checking commons URLs.

  1. Cleaned the output from Jodie, now in processed_lists/clean_metadata_2022-09-29.csv
  2. Used the script processed_lists/clean_raw_export_data.ipynb to grab all Wikidata items with ACT IDs and Commons images. Result in processed_lists/act_items_by_query.csv
  3. There were 2578 ACT works in the dump that had a Commons URL as their copyright value that could be matched to the ACT Wikidata items from the query. There were 940 from the dump that had Commons URLs but couldn't be matched to ACT Wikidata items. Some of these are probably among the 183 Commons images that were black and white, or crops of artworks that we didn't create items for. But that still leaves 757 ACT items from the dump that aren't associated with Wikidata items for some reason and maybe need to be created. There were 348 ACT items from the query that didn't match up with any ACT items in the dump, but that's OK, they probably just aren't listed with the Commons URL as their copyright source.

The next step here is to run the script that looks for tiny Wikidata links on the Commons page and see how many of them don't link to any Wikdata items. Then the issue would be to add ACT links to the ones that do have links and potentially create Wikidata items for those that don't.