Question about input data dimensionality

FalsoMoralista commented 3 months ago

I have being doing a quick reading through the paper and found it very insightful in many ways. In that sense I wish to thank you in advance for making this available.

I apologize if I've missed something and correct me if I'm wrong. From what I've understand, in the first level of hierarchical K-Means clustering is performed on 1024-d features, into e.g., 10M clusters. In this sense are the raw features inputted directly into K-Means or does any dimensionality reduction heuristic is performed prior to clustering? I couldn't find much info regarding that.

Thanks in advance.

huyvvo commented 2 months ago

Hello, we use the raw features in our pipeline without applying dimensionality reduction. We tried but found that it leads to unsatisfactory results, probably due to reduced feature quality. Running with high-dimension, raw features of course leads to heavy computation in the first level so for large dataset, we perform the first-level k-means in two steps. For example on our 743M dataset, we first divide all data points into 100k clusters, then perform k-means again separately and in parallel in in each of the clusters to divide them into 100 smaller ones. This results in 10M clusters in total.

FalsoMoralista commented 1 month ago

Thank you!

facebookresearch / ssl-data-curation

Question about input data dimensionality #9