We currently have no good plan or method for migrating from data generated by Common Crawl scripts to data generated by Provider API Scripts. Whenever we migrate the ingestion from one script type to the other, we simply copy all the old data for a given provider to a new table, and then delete it from the image table.
Some of the data in the old table is generated, e.g., tags derived from clarifai (and soon Rekognition).
Suggested Improvement
We need a way to merge the relevant data from those tables into the new entries in the image table corresponding to the same images.
Benefit
We can save the data we've paid to generate.
Additional context
This will be a bit complex.
We cannot expect the provider, foreign_identifier pair for a given image to stay consistent between Common Crawl and API sourced data. We should therefore attempt to match rows based on the direct image URL when possible.
simple equality is not sufficient, since it's probable that the URL scheme (http vs https) will have changed at some point over the life of the data.
It's possible we actually have both of these schemes for the same image in the Common Crawl sourced data.
Current Situation
We currently have no good plan or method for migrating from data generated by Common Crawl scripts to data generated by Provider API Scripts. Whenever we migrate the ingestion from one script type to the other, we simply copy all the old data for a given provider to a new table, and then delete it from the
image
table.Some of the data in the old table is generated, e.g., tags derived from
clarifai
(and soonRekognition
).Suggested Improvement
We need a way to merge the relevant data from those tables into the new entries in the
image
table corresponding to the same images.Benefit
We can save the data we've paid to generate.
Additional context
This will be a bit complex.
provider, foreign_identifier
pair for a given image to stay consistent between Common Crawl and API sourced data. We should therefore attempt to match rows based on the direct image URL when possible.