cc-archive / cccatalog

[PROJECT TRANSFERRED] Mapping the commons towards an open ledger and cc search.
https://github.com/WordPress/openverse-catalog
MIT License
63 stars 60 forks source link

[Infrastructure] Merge data from Common Crawl Scripts with data from Provider API Scripts in PostgreSQL #468

Closed mathemancer closed 4 years ago

mathemancer commented 4 years ago

Current Situation

We currently have no good plan or method for migrating from data generated by Common Crawl scripts to data generated by Provider API Scripts. Whenever we migrate the ingestion from one script type to the other, we simply copy all the old data for a given provider to a new table, and then delete it from the image table.

Some of the data in the old table is generated, e.g., tags derived from clarifai (and soon Rekognition).

Suggested Improvement

We need a way to merge the relevant data from those tables into the new entries in the image table corresponding to the same images.

Benefit

We can save the data we've paid to generate.

Additional context

This will be a bit complex.