Open rom1504 opened 2 years ago
Pretty extensive doc. I think it would be good to identify the general pipelines and think how each method can work There's 2 big ways
Second interesting axis would be quality. What datasets is it evaluated on, is it targeted at low level feature dedup ("exact" dedup) or semantic features ("near" dedup)
Good points about the two general pipelines. What do you mean by "total order"?
I agree with your point about quality and the consideration between low level (classic hashes) and semantic (embeddings or hashed embeddings). This depends on our overall goal for dedup - I was thinking that we care more about "exact" matches because similar examples in the training set are still distinct and can be actually the most useful relevant examples to "near" duplicates in the evaluation sets.
However, we haven't defined "near" well at all... I can come up with a set of concrete and illustrative examples from the datasets. There are two types of datasets - synthetic (automatically augment images with resize, crop, color adjustment, etc.) and real (manually found very near duplicate images - e.g. a "burst" shot on a iphone takes distinct but super similar images).
Laion400m fastdup observation https://docs.google.com/document/d/1XlYbMxZH7aoOf9RJI00TRuebfVXOxmygpHVGB5fD0is/edit?usp=sharing by @RyanMarten
https://docs.google.com/document/d/1kYLhFbICftToahC9HEAiNgqscuOqoFvslKGIG3cyjbw/edit?usp=drivesdk from rmarten