Closed baskaufs closed 2 years ago
Check out the list of QC items. Also, there is the list add_to_wikidata.csv that theoretically should have the items that need to be added, but I don't know if it's up to date. So it should be crosschecked.
The clean_ids.csv
file contains 3310 filenames and the add_to_wikidata.csv
file contains 2317 filenames. The works_already_in_wikidata.csv file has 927 filenames.
3310 - 927 = 2383, which is approximately the number of add_to_wikidata.csv
filenames, but there were some duplicates discovered in the previous list of works to be added to Wikidata, so those may persist in the clean_ids.csv list.
It's probably best to deduplicate the clean_ids list and generate the list of ACT works already in Wikidata via SPARQL rather than relying on these lists to be correct. But they might make a good cross-check.
See https://github.com/HeardLibrary/vandycite/issues/43#issuecomment-1027561900 for notes on disambiguation.
Flies deleted from Commons:
I moved all three of these from the cleaned_output.csv
file to the images_removed_from_commons.csv
file.
[Side note: Anne checked these images in ACT and two were already deleted and the remaining one was deleted. Just fyi. ]
Updated works_already_in_wikidata.csv
using SPARQL query results. NOTE: some works have two image filenames if the work is depicted twice. So there are actually fewer than 953 works (number of rows) in Wikidata.
Removed works already in Wikidata and generated new add_to_wikidata.csv
file. However, there are 27 rows that say they have Q IDs, so I'm going through manually to figure out why. In some cases, they didn't have English labels (the query incorrectly required that). Some are holdover Q IDs from when the items were linked to a non-artwork item.
After finishing that, regenerated the "already in Wikidata" table and removed those items from the add_to_wikidata.csv
file, which now is clean, I think.
Remaining task, taken from what's left of https://github.com/HeardLibrary/vandycite/issues/43#issuecomment-1022279234
The remaining items should probably be double-checked to make sure that none of them have Wikidata links on their commons pages to make sure we don't accidentally make duplicate items. If any are found, the item pages should be examined to see if they are artworks (likely) or non-artworks. If artworks, then manually make the ACT ID link. If non-artworks, then no action is required -- we will be creating the artwork pages as a part of this process. However, we should keep a list of these cases, since the link on the Commons page should be transferred to the new artwork page we create.
There are actually quite a number of works that are coming up in this last check. In at least some cases, the ACT record is pointing to a Commons work that is another image of the abstract artwork from the one that is considered primary and linked to by the Wikidata item for the artwork. Example:
Note that both Commons works link back to the same Wikidata item for the abstract artwork, even though only one is used for the image statement.
The list of Commons works that have links to Wikidata items is at https://github.com/HeardLibrary/vandycite/blob/master/act/create_items/wikidata_found.csv
These 403 items probably should have the ACT IDs added to them and this could be done using VanderBot. However, they probably need to all be checked to see why they weren't picked up before and in that case, it may be just as easy to add them manually.
The list of 2310 artwork items that need to be created https://github.com/HeardLibrary/vandycite/blob/master/act/create_items/add_to_wikidata.csv includes these words, but rather than removing them manually, it probably should just be recreated by screening as was done before.
Re-ran the screening after Charlotte's manual fixes using the disambiguate_prior_to_phase_2b.ipynb
script. Steps were:
clean_ids.csv
run the cells looking for duplicates. There weren't any unaccounted for (as before). The accounting should be clean_ids.csv
- duplicates_of_existing_commons_ids.csv
= cleaned_output.csv
: 3310 - 72 = 3238, which is two higher the cleaned_output.csv N=3236. Not sure why, but I guess that's OK.works_already_in_wikidata.csv
) was 969. After updating, there were 1492. Charlotte must have added about 350 -- not sure where the rest came from.add_to_wikidata.csv
file had 2311 works. After running the cell to remove works already in again the new number was 2025, which seems about right based on the number Charlotte added.works_already_in_wikidata.csv
and duplicates_of_existing_commons_ids.csv
manually.disambiguate_prior_to_phase_2b.ipynb
. Saved updated version of add_to_wikidata.csv
in both folders.
clean_ids.csv is at https://github.com/HeardLibrary/vandycite/blob/master/act/clean_ids.csv