Migrate Science Museum from Common Crawl to Provider API scripts

kgodey commented 4 years ago

In order to migrate a provider from Common Crawl to a Provider API Script, we need to do some work in the database.

We need to associate the new Foreign ID (from the Provider API Script) with the former Foreign ID (from Common Crawl). We expect these to be different. However, making this mapping is essential to make sure that we don't lose any data (e.g., enriched metadata or tags we've generated) in the migration.

The best way to accomplish this would be to:

Check that the Science Museum URLs are not in need of deduplication (this will require some looking through the data by eyeball).
If (1) succeeds, use the URL of the image to associate the old and new version of the row.
Merge the tags and metadata between the old and new version of the row using the same logic used by the tsv_to_postgres_loader workflow.

mathemancer commented 4 years ago

After looking at the data, the Science Museum URLs have changed domain at some point, and so these rows are in need of deduplication.

Thus, we'll move the current data out of the image table, and proceed without it (hopefully we can add it back in at some point).

mathemancer commented 4 years ago

This is done. There is a table called science_museum_2020_06_02 in the Upstream DB with the old data (including Clarifai tags) from Science Museum, and all Science Museum data is gone from the image table.

cc-archive / cccatalog

Migrate Science Museum from Common Crawl to Provider API scripts #415