BAAI-DCAI / Bunny

A family of lightweight multimodal models.
Apache License 2.0
865 stars 65 forks source link

Have you tried dbscan? #20

Closed carry-xz closed 5 months ago

carry-xz commented 5 months ago

Great job, have you tried dbscan? Which one do you think is better using kmeans or dbscan? I think it can be encapsulated into a general data processing program

Isaachhh commented 5 months ago

We haven't tried dbscan.

For the pruning of LAION-2B, we conduct a method of semantically de-duplicating, based on cosine similarity of image embeddings. We think the idea of our method is like dbscan to some degree. We use k-means to help to reduce computing time, that we believe two embeddings belonging to different clusters are hardly semantical duplicates.

Due to the huge scale of LAION-2B, we store it as 232320 tars. It's hard to encapsulate the code into a plug-and-play program. We supply our code.

For more information and the code, please refer to here.

Isaachhh commented 5 months ago

Close the issue for now if there's no further discussions. Feel free to reopen it if there's any other questions.