LAION-AI / project-menu

Projects at LAION
MIT License
10 stars 4 forks source link

Acquisition of billion sized image-text-pairs dataset #1

Closed rvencu closed 2 years ago

rvencu commented 2 years ago

Continue the workflow to transform Common Crawl data into a billion sized image-text-pairs dataset. Currently at about 1.2 billion in size. Basic workflow steps are:

  1. parsing the common crawl WAT files and retain candidates URL-text pairs
  2. download the images at URLs and filter unwanted items, also resize to 224x224 with center-crop the remaining ones
  3. send batches of images and metadata to CLIP inference to calculate similarity scores, then keep only pairs with score > 0.3
rom1504 commented 2 years ago

done and released