Closed baskaufs closed 2 years ago
@baskaufs will work through these steps and generate a spreadsheet to use for quality control.
In the broadest terms:
The spreadsheet clean_ids.csv
should be the starting point of determining what needs to be uploaded. It probably needs to be double-checked for duplicates.
This can be cross-checked against the works_already_in_wikidata.csv
file, which I think is up-to-date, but needs to be confirmed with a SPARQL query. These works can be removed from the list above to determine the candidates for addition.
Those IDs can then be used to extract appropriate data from act_all_202109241353_repaired.csv
, the clean data source from the ACT data dump. That extracted data would be combined with data downloaded from Commons using the script: [?], to create the combined table for proofreading.
I've summarized the files and how they were used in the first part of Phase 2 on the create_items folder README.
I used this script to start with the clean_ids.csv
file and search for duplicates. I found a few that were missed in the cleaned_output.csv
file, and removed them to the 'duplicates_of_existing_commons_ids.csv` file.
So the cleaned_output.csv
really should be the starting point, since it's got duplicates removed. When it was created, all of it's URLs were checked to make sure that they dereferenced. This should be done again in case any others were taken down since it was last checked.
The script also does a SPARQL query to get the items with ACT ID statements. This should be used in preference to works_already_in_wikidata.csv
since that file seems to be missing a few works.
This issue has either been completed or has parts transferred to https://github.com/HeardLibrary/vandycite/issues/53
This document needs to include whatever was learned from the trials. Note the conclusion of Issue #17 about titles.