Acquisition of billion sized image-text-pairs dataset

Continue the workflow to transform Common Crawl data into a billion sized image-text-pairs dataset. Currently at about 1.2 billion in size. Basic workflow steps are:

parsing the common crawl WAT files and retain candidates URL-text pairs
download the images at URLs and filter unwanted items, also resize to 224x224 with center-crop the remaining ones
send batches of images and metadata to CLIP inference to calculate similarity scores, then keep only pairs with score > 0.3

LAION-AI / project-menu

Acquisition of billion sized image-text-pairs dataset #1