HeardLibrary / vandycite

0 stars 0 forks source link

Quality control clean_ids.csv and determine which need to be written #53

Closed baskaufs closed 2 years ago

baskaufs commented 2 years ago

clean_ids.csv is at https://github.com/HeardLibrary/vandycite/blob/master/act/clean_ids.csv

baskaufs commented 2 years ago

Check out the list of QC items. Also, there is the list add_to_wikidata.csv that theoretically should have the items that need to be added, but I don't know if it's up to date. So it should be crosschecked.

baskaufs commented 2 years ago

The clean_ids.csv file contains 3310 filenames and the add_to_wikidata.csv file contains 2317 filenames. The works_already_in_wikidata.csv file has 927 filenames.

3310 - 927 = 2383, which is approximately the number of add_to_wikidata.csv filenames, but there were some duplicates discovered in the previous list of works to be added to Wikidata, so those may persist in the clean_ids.csv list.

It's probably best to deduplicate the clean_ids list and generate the list of ACT works already in Wikidata via SPARQL rather than relying on these lists to be correct. But they might make a good cross-check.

baskaufs commented 2 years ago

See https://github.com/HeardLibrary/vandycite/issues/43#issuecomment-1027561900 for notes on disambiguation.

baskaufs commented 2 years ago

Flies deleted from Commons:

I moved all three of these from the cleaned_output.csv file to the images_removed_from_commons.csv file.

[Side note: Anne checked these images in ACT and two were already deleted and the remaining one was deleted. Just fyi. ]

baskaufs commented 2 years ago

Updated works_already_in_wikidata.csv using SPARQL query results. NOTE: some works have two image filenames if the work is depicted twice. So there are actually fewer than 953 works (number of rows) in Wikidata.

baskaufs commented 2 years ago

Removed works already in Wikidata and generated new add_to_wikidata.csv file. However, there are 27 rows that say they have Q IDs, so I'm going through manually to figure out why. In some cases, they didn't have English labels (the query incorrectly required that). Some are holdover Q IDs from when the items were linked to a non-artwork item.

After finishing that, regenerated the "already in Wikidata" table and removed those items from the add_to_wikidata.csv file, which now is clean, I think.

baskaufs commented 2 years ago

Remaining task, taken from what's left of https://github.com/HeardLibrary/vandycite/issues/43#issuecomment-1022279234

The remaining items should probably be double-checked to make sure that none of them have Wikidata links on their commons pages to make sure we don't accidentally make duplicate items. If any are found, the item pages should be examined to see if they are artworks (likely) or non-artworks. If artworks, then manually make the ACT ID link. If non-artworks, then no action is required -- we will be creating the artwork pages as a part of this process. However, we should keep a list of these cases, since the link on the Commons page should be transferred to the new artwork page we create.

baskaufs commented 2 years ago

There are actually quite a number of works that are coming up in this last check. In at least some cases, the ACT record is pointing to a Commons work that is another image of the abstract artwork from the one that is considered primary and linked to by the Wikidata item for the artwork. Example:

Note that both Commons works link back to the same Wikidata item for the abstract artwork, even though only one is used for the image statement.

baskaufs commented 2 years ago

The list of Commons works that have links to Wikidata items is at https://github.com/HeardLibrary/vandycite/blob/master/act/create_items/wikidata_found.csv

These 403 items probably should have the ACT IDs added to them and this could be done using VanderBot. However, they probably need to all be checked to see why they weren't picked up before and in that case, it may be just as easy to add them manually.

The list of 2310 artwork items that need to be created https://github.com/HeardLibrary/vandycite/blob/master/act/create_items/add_to_wikidata.csv includes these words, but rather than removing them manually, it probably should just be recreated by screening as was done before.

baskaufs commented 2 years ago

Re-ran the screening after Charlotte's manual fixes using the disambiguate_prior_to_phase_2b.ipynb script. Steps were:

  1. Starting with the clean_ids.csv run the cells looking for duplicates. There weren't any unaccounted for (as before). The accounting should be clean_ids.csv - duplicates_of_existing_commons_ids.csv = cleaned_output.csv: 3310 - 72 = 3238, which is two higher the cleaned_output.csv N=3236. Not sure why, but I guess that's OK.
  2. Skip the dereferencing test, takes too long and ran it recently.
  3. The previous number of works already in Wikidata (from works_already_in_wikidata.csv) was 969. After updating, there were 1492. Charlotte must have added about 350 -- not sure where the rest came from.
  4. Previous add_to_wikidata.csv file had 2311 works. After running the cell to remove works already in again the new number was 2025, which seems about right based on the number Charlotte added.
  5. Re-ran the check for the little Wikidata flags. Discovered that my script was picking up bad links to artists and fixed that. Ran again but there were still about 15 new Commons works showing up with links to Wikidata that hadn't been caught before. In some cases, it's because someone recently added the link to the Commons page. In others, there are a variety of issues resolved individually. Updated works_already_in_wikidata.csv and duplicates_of_existing_commons_ids.csv manually.
  6. Re-ran cell to remove works already in Wikidata in disambiguate_prior_to_phase_2b.ipynb. Saved updated version of add_to_wikidata.csv in both folders.