LAION-AI / laion-dedup

13 stars 2 forks source link

Docs #3

Open rom1504 opened 2 years ago

rom1504 commented 2 years ago

https://docs.google.com/document/d/1kYLhFbICftToahC9HEAiNgqscuOqoFvslKGIG3cyjbw/edit?usp=drivesdk from rmarten

rom1504 commented 2 years ago

Pretty extensive doc. I think it would be good to identify the general pipelines and think how each method can work There's 2 big ways

  1. Sorting based : that works when you have a total order on the image representation. Think hashes such as h(e1) = h(e2) <=> e1 = e2 and h(x) is a sequence of bytes
  2. Pair comparison. Think embeddings that have a meaningful sim(e1, e2) function. Those can be used for clustering or knn search Depending on whether it's 1 or 2 that will impact the efficiency of the dedup.

Second interesting axis would be quality. What datasets is it evaluated on, is it targeted at low level feature dedup ("exact" dedup) or semantic features ("near" dedup)

RyanMarten commented 2 years ago

Good points about the two general pipelines. What do you mean by "total order"?

I agree with your point about quality and the consideration between low level (classic hashes) and semantic (embeddings or hashed embeddings). This depends on our overall goal for dedup - I was thinking that we care more about "exact" matches because similar examples in the training set are still distinct and can be actually the most useful relevant examples to "near" duplicates in the evaluation sets.

However, we haven't defined "near" well at all... I can come up with a set of concrete and illustrative examples from the datasets. There are two types of datasets - synthetic (automatically augment images with resize, crop, color adjustment, etc.) and real (manually found very near duplicate images - e.g. a "burst" shot on a iphone takes distinct but super similar images).

rom1504 commented 2 years ago

https://en.m.wikipedia.org/wiki/Total_order

rom1504 commented 2 years ago

Laion400m fastdup observation https://docs.google.com/document/d/1XlYbMxZH7aoOf9RJI00TRuebfVXOxmygpHVGB5fD0is/edit?usp=sharing by @RyanMarten