ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Experts via Clustering
thx for your interests. We use our internal NSFW filters and dedup system. You may consider some open source solutions like ones from DataComp (be aware they use OpenAI CLIP then may not very from scratch)?
Dear authors,
First of all, thanks for this very interesting paper and code release.
I am working on building a small datasets with your pipeline (from CommonCrawl using queries) and came across the following questions:
Any pointers about how your process looks like would help alot in reproducing your pipeline.
Thanks,