Continue the workflow to transform Common Crawl data into a billion sized image-text-pairs dataset. Currently at about 1.2 billion in size.
Basic workflow steps are:
parsing the common crawl WAT files and retain candidates URL-text pairs
download the images at URLs and filter unwanted items, also resize to 224x224 with center-crop the remaining ones
send batches of images and metadata to CLIP inference to calculate similarity scores, then keep only pairs with score > 0.3
Continue the workflow to transform Common Crawl data into a billion sized image-text-pairs dataset. Currently at about 1.2 billion in size. Basic workflow steps are: