cc-archive / cccatalog

[PROJECT TRANSFERRED] Mapping the commons towards an open ledger and cc search.
https://github.com/WordPress/openverse-catalog
MIT License
63 stars 60 forks source link

[Bug] Met Museum Foreign Identifiers are non-deterministic #416

Closed mathemancer closed 4 years ago

mathemancer commented 4 years ago

Bug Description

A significant portion of the Foreign IDs for images from the Met Museum have been 'randomly shuffled' every time they've been collected since the very beginning.

This blocks our deduplication of those images (See #188 ). The problem is that it will be quite difficult (up to and possibly including impossible) to reassociate the proper Clarifai image tags with the proper images.

Expected behavior

The Foreign ID should never change for a given image.

Screenshots

Additional context

  1. Modify the script to save the images with a deterministic Foreign ID, and also save metadata about which real-world object the image shows.
  2. Run that script on the entire collection, essentially duplicating again all images from the Met Museum (but in a controlled manner) in the image table (the main table with our image metadata).
  3. Create a new table of all rows from the image table containing all Met Museum metadata.
  4. Delete all rows from the image table containing Met Museum metadata.

At that point, #188 will be unblocked, but we'll have no images from the MetMuseum in the front end if that state persists until the following steps are complete:

  1. Write a script that cleans up and fixes the Met Museum data (this will be complicated, and require serious effort)
  2. Insert the cleaned metadata for the Met Museum into the image table.