HeardLibrary / vandycite

0 stars 0 forks source link

Perform manual checks on combined CSV, then write with VanderBot #56

Closed baskaufs closed 2 years ago

baskaufs commented 2 years ago

There is an edge case where a Commons image does not have a link to a Wikidata artwork, but is a derivative (crop or black and white version] of another Commons image that does. In this case, we would erroneously create an artwork item for the derivative (basically creating a duplicate artwork item), when actually that image should probably just be set aside until we decide how to handle derivative cases. This happened on an image of some shoes that were a crop of a larger, famous wedding painting and was picked up by the Sum of All Paintings people and flagged. This is most likely to happen in cases where the artwork is famous, so hopefully these will get picked up via the little Wikidata flag links. However, we probably should manually check any that are by famous artists to see if we can find an actual artwork item for the Commons work.

baskaufs commented 2 years ago

In two emails from Charlotte (on 2022-04-22), she reported that she's finished most of the screening. The three problematic categories are:

  1. Names with low similarity scores (possible conflation of artist and photographer, probably primarily for 3D works).
  2. Works with significant difference in inception dates between ACT and Commons (listed in "issues with inception dates" Word doc she sent).
  3. Rows in the spreadsheet that are still blank in the creator column. Asked in a followup email for clarification about what those cells might involve.
baskaufs commented 2 years ago

Created script remove_problematic_rows_before_upload.ipynb to do the steps outlined above. That resulted in about half of the rows getting screened out.

baskaufs commented 2 years ago

Remaining steps before upload:

  1. Do some label cleanup as was done in the ETD script (remove double spaces, change quotes to single quotes)
  2. Spot check some of the labels that say "detail" to see how frequently they are details of works already in Wikidata.
  3. Use the code from the gallery script to add some additional text to the description if two works have identical labels and descriptions.
  4. Manual cleanup to get rid of stuff like square brackets or other weird characters in the label.
  5. The title field is currently blank. It needs to be filled in using the label for labels that are in English.
  6. Fix "creatorobject has_role" column, which isn't separated from the next one. The "anon" columns need a value there.
baskaufs commented 2 years ago

Did the spot checking for second item above and almost every one, plus a lot of the other works have links to Wikidata items. So they need to be checked again for the little flag links from the Commons page to find out why they got missed. Then all of those details need to be pulled out and put with the others that we pulled out before.

baskaufs commented 2 years ago

I've now finished most of the tasks in the list above using this script. Here are the files I spawned during the screening process:

The remaining 911 images are in the works_to_write.csv file. I've cleaned up the labels and generated the title columns and theoretically they could be close to ready to upload. However, while doing spot checks, it seems like about half of the paintings and some other artwork categories are already in Wikidata -- we just haven't detected them. So I don't currently feel good about potentially creating that many duplicates.

Also, there are a lot of the works that have "detail" in their name and many times they are details of artworks that are already in Wikidata, or are details of other entire works that we'd be writing. So those may neeed to be pulled out as a separate category of things to deal with later (as I think we've already done with black and white, etc. variants).

baskaufs commented 2 years ago

Added fuzzy matching for titles to remove_problematic_rows_before_upload.ipynb. The result is possible_matches.csv, where the matches are ranked by score. Many of the high-ranked works are definite matches, some of the low ranked ones are also matches but ranked low because the ACT name varies from what's in Wikidata (for example because it's a translation from a non-English language).

In some cases the reason we didn't discover the match was because the Wikidata item simply isn't linked (no image property, no backlink from Commons). In other cases, the Commons image ACT uses is a different version from the one linked through the image property in Wikidata and there is no back-link from the Commons image used by ACT.

baskaufs commented 2 years ago

Added section to script to remove works that are potential matches and that contain "Detail" or "detail" into file works_that_are_details.csv (34 images). That leaves 661 works to write in works_to_write.csv.

baskaufs commented 2 years ago

Changed name of works_to_write.csv and uploaded 656 works from abstract_artworks.csv. A few works got moved to the duplicates file.

baskaufs commented 2 years ago

Created script addition to screen and fix descriptions for works with unidentified artists. It added anonymous values where appropriate. picture_app_screenshot

Screened and uploaded 507 items using VanderBot.

baskaufs commented 2 years ago

Decision grid for description screening: IMG_3904

baskaufs commented 2 years ago

Went through Anne's spreadsheet with the date cleanup and corrected dates if necessary in the date_problems.csv file. Fixed any missing artists, changed anon to _:, then screened manually for works that were details of others and put them here.

Did the fuzzy matching check and also some manual checks to discover more works that were already in Wikidata and put them here. Did some final label and description cleanup, then wrote 101 new artwork items. Upload records are in abstract_artworks_from_fixed_dates.csv