Closed carry-xz closed 5 months ago
We haven't tried dbscan.
For the pruning of LAION-2B, we conduct a method of semantically de-duplicating, based on cosine similarity of image embeddings. We think the idea of our method is like dbscan to some degree. We use k-means to help to reduce computing time, that we believe two embeddings belonging to different clusters are hardly semantical duplicates.
Due to the huge scale of LAION-2B, we store it as 232320 tars. It's hard to encapsulate the code into a plug-and-play program. We supply our code.
For more information and the code, please refer to here.
Close the issue for now if there's no further discussions. Feel free to reopen it if there's any other questions.
Great job, have you tried dbscan? Which one do you think is better using kmeans or dbscan? I think it can be encapsulated into a general data processing program