Closed baskaufs closed 2 years ago
In two emails from Charlotte (on 2022-04-22), she reported that she's finished most of the screening. The three problematic categories are:
Created script remove_problematic_rows_before_upload.ipynb to do the steps outlined above. That resulted in about half of the rows getting screened out.
Remaining steps before upload:
Did the spot checking for second item above and almost every one, plus a lot of the other works have links to Wikidata items. So they need to be checked again for the little flag links from the Commons page to find out why they got missed. Then all of those details need to be pulled out and put with the others that we pulled out before.
I've now finished most of the tasks in the list above using this script. Here are the files I spawned during the screening process:
creator
missing_creators.csvThe remaining 911 images are in the works_to_write.csv file. I've cleaned up the labels and generated the title columns and theoretically they could be close to ready to upload. However, while doing spot checks, it seems like about half of the paintings and some other artwork categories are already in Wikidata -- we just haven't detected them. So I don't currently feel good about potentially creating that many duplicates.
Also, there are a lot of the works that have "detail" in their name and many times they are details of artworks that are already in Wikidata, or are details of other entire works that we'd be writing. So those may neeed to be pulled out as a separate category of things to deal with later (as I think we've already done with black and white, etc. variants).
Added fuzzy matching for titles to remove_problematic_rows_before_upload.ipynb. The result is possible_matches.csv, where the matches are ranked by score. Many of the high-ranked works are definite matches, some of the low ranked ones are also matches but ranked low because the ACT name varies from what's in Wikidata (for example because it's a translation from a non-English language).
In some cases the reason we didn't discover the match was because the Wikidata item simply isn't linked (no image property, no backlink from Commons). In other cases, the Commons image ACT uses is a different version from the one linked through the image property in Wikidata and there is no back-link from the Commons image used by ACT.
Added section to script to remove works that are potential matches and that contain "Detail" or "detail" into file works_that_are_details.csv (34 images). That leaves 661 works to write in works_to_write.csv.
Changed name of works_to_write.csv
and uploaded 656 works from abstract_artworks.csv. A few works got moved to the duplicates file.
Created script addition to screen and fix descriptions for works with unidentified artists. It added anonymous values where appropriate.
Screened and uploaded 507 items using VanderBot.
Decision grid for description screening:
Went through Anne's spreadsheet with the date cleanup and corrected dates if necessary in the date_problems.csv file. Fixed any missing artists, changed anon
to _:
, then screened manually for works that were details of others and put them here.
Did the fuzzy matching check and also some manual checks to discover more works that were already in Wikidata and put them here. Did some final label and description cleanup, then wrote 101 new artwork items. Upload records are in abstract_artworks_from_fixed_dates.csv
There is an edge case where a Commons image does not have a link to a Wikidata artwork, but is a derivative (crop or black and white version] of another Commons image that does. In this case, we would erroneously create an artwork item for the derivative (basically creating a duplicate artwork item), when actually that image should probably just be set aside until we decide how to handle derivative cases. This happened on an image of some shoes that were a crop of a larger, famous wedding painting and was picked up by the Sum of All Paintings people and flagged. This is most likely to happen in cases where the artwork is famous, so hopefully these will get picked up via the little Wikidata flag links. However, we probably should manually check any that are by famous artists to see if we can find an actual artwork item for the Commons work.