cc-archive / cccatalog

[PROJECT TRANSFERRED] Mapping the commons towards an open ledger and cc search.
https://github.com/WordPress/openverse-catalog
MIT License
63 stars 60 forks source link

Come up with a solution for consuming crawler events #457

Closed aldenstpage closed 3 years ago

aldenstpage commented 4 years ago

We're reading metadata from images on a large scale and sticking it into some Kafka topics. We ought to start incorporating this data into the data layer so we can use it in CC Search. The format of the data is documented here. In summary:

This data can be produced continuously by the crawler, so we should prefer building streaming consumers over reading topics in batches.

We know from experience now that dumping this into the meta_data column en masse is not a good option, so this is a good time to start thinking about alternatives.