Closed kgodey closed 4 years ago
After looking at the data, the Science Museum URLs have changed domain at some point, and so these rows are in need of deduplication.
Thus, we'll move the current data out of the image table, and proceed without it (hopefully we can add it back in at some point).
This is done. There is a table called science_museum_2020_06_02
in the Upstream DB with the old data (including Clarifai tags) from Science Museum, and all Science Museum data is gone from the image
table.
In order to migrate a provider from Common Crawl to a Provider API Script, we need to do some work in the database.
We need to associate the new Foreign ID (from the Provider API Script) with the former Foreign ID (from Common Crawl). We expect these to be different. However, making this mapping is essential to make sure that we don't lose any data (e.g., enriched metadata or tags we've generated) in the migration.
The best way to accomplish this would be to:
tsv_to_postgres_loader
workflow.