HeardLibrary / vandycite

0 stars 0 forks source link

Create workflow to create new ACT Wikidata items and upload to Commons #103

Closed baskaufs closed 9 months ago

baskaufs commented 1 year ago

Need to modify parts of previous workflows to be usable with new ACT items. This issue will supersede https://github.com/HeardLibrary/vandycite/issues/91

baskaufs commented 1 year ago

Tasks:

Issues previously recognized with CommonsTool were:

baskaufs commented 1 year ago

Previous scripts:

create_act_items.ipynb https://github.com/HeardLibrary/vandycite/blob/master/act/create_items/create_act_items.ipynb Need to check what it works for. Notes say it's for works already in Commons whose Wikidata items aren't artwork and that eventually it needs to work for Commons works without Wikidata items.

act.ipynb https://github.com/HeardLibrary/vandycite/blob/master/act/act.ipynb was used for cleaning up and matching with existing Wikidata items.

commons_data.ipynb https://github.com/HeardLibrary/linked-data/blob/master/commonsbot/commons_data.ipynb was used to scrape data from Commons existing items.

commonstool.py https://github.com/HeardLibrary/linked-data/blob/master/commonsbot/commonstool.py is the general-purpose tool for uploading to Commons. The local copy in the directory is not substantively different, so the main copy could be used instead of it.

The README for the processed_lists directory has a summary of files and the scripts that generated them

baskaufs commented 1 year ago

Files:

act_data_fix.csv https://github.com/HeardLibrary/vandycite/blob/master/act/processed_lists/act_data_fix.csv Database dump format. This was previously used as input for creating Wikidata items, but for works that were already in Commons.

act_all_202209291736.csv https://github.com/HeardLibrary/vandycite/blob/master/act/processed_lists/act_all_202209291736.csv is the most recent complete ACT dump.

richardson.csv https://github.com/HeardLibrary/vandycite/blob/master/act/richardson_upload/richardson.csv is a subset of the most recent dump that only includes works by Ann Richardson and Jim Womack.

filenames.csv https://github.com/HeardLibrary/vandycite/blob/master/act/richardson_upload/filenames.csv links the ACT IDs to the fullsize file download URLs.

Other items in the new_items directory were used in the test uploads. This includes the config.json file and csv-metadata.json files that go with the previous CSV.

artwork_metadata.csv https://github.com/HeardLibrary/vandycite/blob/master/act/create_items/new_items/artwork_metadata.csv Input file for CommonsTool containing Q ID, local identifier, dimension (2D or 3D), and status (copyright) of artworks.

images.csv https://github.com/HeardLibrary/vandycite/blob/master/act/create_items/new_items/images.csv Input file for CommonsTool containing image metadata: file size, creation date, pixel dimensions, foreign key to accession. Get most of this from EXIF in preprocessing.

commons_images.csv https://github.com/HeardLibrary/vandycite/blob/master/act/create_items/new_items/commons_images.csv is the record of Commons uploads. It is output only, except for being used to skip over images that are already done.

new_act_artworks.csv https://github.com/HeardLibrary/vandycite/blob/master/act/create_items/new_items/new_act_artworks.csv has the necessary column headers for the Wikidata upload of the image and the manifest claims after the commons upload. But it was also used in the initial Wikidata item creation.

baskaufs commented 12 months ago

Screened out:

Added dates that Charlotte looked up

Designated images as 3D or 2D and transferred values to the spreadsheet from the database dump (richardson.csv)

baskaufs commented 10 months ago

Prep for Wikidata upload on 2023-10-06:

Charlotte added information to the titles in the richardson_duplicated_titles_edited.xlsx file. These new titles should be used. That spreadsheet also has the part_of values that we need to use to link the works to the cathedral or other building that they are part of. After making the edits below, the sheet was saved as richardson_duplicate_titles_prep.xlsx

  1. Deleted rows with red ID numbers. Those are images that are basically redundant with other images. Saved as redundant_items.csv (and Excel)
  2. Exported the Excel file as a CSV and saved it as richardson_duplicate_titles_prep.csv (and Excel)
  3. Modified the config.json and richardson_upload.ipynb files to add the "part_of" column. Added the part_of column to the full dataset (richardson.csv)
  4. Scripted to remove the redundant items and replace the original titles with the new wrangled titles
  5. NOTE: creators that are not anonymous must be added manually!!!!!
  6. Manually added part_of values for places that were identified in the duplicates file but didn't have a value in the original table.
  7. Manual proofreading
baskaufs commented 10 months ago

On 2023-10-06, uploaded 905 artwork items, recorded in act_artworks.csv.

baskaufs commented 10 months ago

Added code to richardson_upload.ipynb to generate the artwork_metadata.csv file needed as the general artwork metadata input file of commonstool.py and also the images.csv file that is the general image metadata input file.

Also merged and transferred data from the two test uploads that were duplicated when I created all of the Wikidata items.

baskaufs commented 9 months ago

Modified commonstool.py and its associated commonstool_config.yaml file to allow for multiple photographers of 3D images and to add the necessary qualifiers for "file available on the internet" sources.

baskaufs commented 9 months ago

Upload completed