facebookresearch / MetaCLIP

ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Experts via Clustering
Other
1.27k stars 54 forks source link

Questions about curating from scratch #9

Open simon-ging opened 1 year ago

simon-ging commented 1 year ago

Dear authors,

First of all, thanks for this very interesting paper and code release.

I am working on building a small datasets with your pipeline (from CommonCrawl using queries) and came across the following questions:

  1. How do you do NSFW filtering?
  2. How do you deduplicate?

Any pointers about how your process looks like would help alot in reproducing your pipeline.

Thanks,

howardhsu commented 1 year ago

thx for your interests. We use our internal NSFW filters and dedup system. You may consider some open source solutions like ones from DataComp (be aware they use OpenAI CLIP then may not very from scratch)?