About DataComp-1B experiments

lyakaap commented 11 months ago

Really nice work, the MetaCLIP paper and this codebase are very insightful :)

I have a question about the paper's "A.1 ADDITIONAL RESULTS", regarding the comparison between the commonpools and MetaCLIP, and discussion about commonpools's bias.

I thought that the reason for the poor accuracy of CommonPools-1B might be due to face-blurring. This is because the url in relative paths (samples starting with "data:image") only account for a few percent of the total (please correct me if I'm wrong), so it could not affect the accuracy much.

How do you think about it?

howardhsu commented 11 months ago

Thx for bring out this details.

Overall human supervision matters (such as how much money paid for building a website).

(All data are face blurred by the same tech and IN expects 0.2-0.3 drop, so we exclude this confounding factor.)

By relative URLs, we meant ./1.jpg in an <img> together with the HTML page's target URI http://www.abc.com should point to http://www.abc.com/1.jpg. There might be about 40% URLs of this type and unfortunately LAION/DataComp parser has a startswith("http") filter at that time (now it's fixed). This reduces not just the scale of pool but also<img> tags from big websites that frequently use this type of URLs (to save their DNS loads as a good practice of design and better supervision on contents) and implicitly increased the chance of <img> tag from another "malicious" website keeps referencing other websites' images w/o hosting the actual images. We don't know exactly how much performance it hurts yet as our main goal is to mitigates filter biases not study filters.

data:image are embedded images which are minor cases (as you mentioned), less likely contr. to the perf. difference.

As a summary, again, all subtle details including CLIP curation and our algorithm points better human supervision, not just if alt matches wiki/wordnet quality spelling, but also whether images are taken w/ more dedicated human efforts, texts w/ longer writing time or w/ careful chosen terms, even URLs designed to save DNS traffic ...

lyakaap commented 11 months ago

There might be about 40% URLs of this type and unfortunately LAION/DataComp parser has a startswith("http") filter at that time (now it's fixed).

I didn't know that the curation code was such implementation at that time...! Looking at the current implementation of cc2dataset, it seems all relative paths are resolved to absolute paths, so the difference between metaclip parser and cc2dataset parser only appeared to start with "data:image", which accounts for only small fraction part (ref: https://github.com/rom1504/cc2dataset/blob/28f4883fd12f62bbd2ae8931442bb0988eb728a3/cc2dataset/main.py#L108).

I appreciate very much for your detailed explanation. Now everything is clear for me.

facebookresearch / MetaCLIP

About DataComp-1B experiments #34