We're reading metadata from images on a large scale and sticking it into some Kafka topics. We ought to start incorporating this data into the data layer so we can use it in CC Search. The format of the data is documented here. In summary:
We know the dimensions, filesize, and compression rate of images in the image_metadata_updates topic
In some cases we are able to extract exif metadata, which also goes into the image_metadata_updates topic.
We record 404s in the link_rot topic
This data can be produced continuously by the crawler, so we should prefer building streaming consumers over reading topics in batches.
We know from experience now that dumping this into the meta_data column en masse is not a good option, so this is a good time to start thinking about alternatives.
We're reading metadata from images on a large scale and sticking it into some Kafka topics. We ought to start incorporating this data into the data layer so we can use it in CC Search. The format of the data is documented here. In summary:
image_metadata_updates
topicimage_metadata_updates
topic.link_rot
topicThis data can be produced continuously by the crawler, so we should prefer building streaming consumers over reading topics in batches.
We know from experience now that dumping this into the meta_data column en masse is not a good option, so this is a good time to start thinking about alternatives.