Nice work! A few clarifications to reproduce results on ImageNet

black0017 commented 5 months ago

Hello @LeslieTrue, and very nice work! Congrats on the ICLR acceptance!

I am really interested in reproducing your results on ImageNet. To this end, I would like to ask 2 things:

1. Hyperparameters on the paper and script args

After trying to match the code to the paper and supp. (table 7), I end up with the following hyperparameters for ImageNet

 "hidden_dim": 2048, 
 "z_dim": 1024, 
 "n_clusters": 1000, 
 "epo": 20, 
 "bs": 1024, 
 "lr": 0.0001, 
 "lr_c": 0.0001, 
 "momo": 0.9, 
 "pigam": 0.05, 
 "wd1": 0.0001, 
 "wd2": 0.005, 
 "eps": 0.1,   #  used in MLCLoss
 "pieta": 0.12,  # sinkhon knop for imagenet
 "piiter": 5, 
 "seed": 42, 
 "warmup": 2000,  # is this correct? Is this what you mean by 1-2 epochs on imagenet I guess?

Could you please confirm and let me know if there is any other hyperparameter I need to specify that may not be in the paper? Saving the args you specified to get state-of-the-art results to a .json file would help.

2. Evaluation after training

During training the MLPs on top of CLIP, you have an intermediate evaluation, which, to my understanding, is based on the mini-batch. Thus, the provided script main_efficient.py does not have any evaluation to reproduce the NMI and ACC from the paper.

How do I do that? So far, my best guess is that I need to compute z, logits = model(x) for the whole dataset and store the results and afterward:

self_coeff = (logits @ logits.T).abs().unsqueeze(0)
Pi = sink_layer(self_coeff)[0]
Pi = Pi * Pi.shape[-1]
Pi = Pi[0]
Pi_np = Pi.detach().cpu().numpy()
acc_lst, nmi_lst, _, _, pred_lst = spectral_clustering_metrics(Pi_np, n_clusters, y_np)

Is this how you actually evaluated? I guess if the test set has $n$ samples, that means that we need to compute the eigenvalues of an $n \times n$ matrix, which, as far as I can recall, is of $O(n^3)$.

Your help would be highly appreciated and will help us report your method in other datasets!

Thanks in advance, and have a great day!

Nikolas

black0017 commented 5 months ago

Even with the 50K validation samples, I was out of RAM memory (~700GB) when trying to compute the eigenvalues (for the reasons explained above).

There seems to be something missing with the evaluation. Can you please take a look at @LeslieTrue?

LeslieTrue commented 5 months ago

Thanks for reaching out!

Training parameters:

python main_efficient.py --data_dir imagenet-feature.pt --bs 1024 --desc train_CPP_imagenet\ --lr 1e-4 --lr_c 1e-4 --pieta 0.12 --epo 20 --hidden_dim 2048 --z_dim 1024 --warmup 2000 --n_clusters 1000 This script would reproduce the results. Since it's unsupervised training, we directly evaluate the model performance on 15,000 random samples from ImageNet training set. For adjustment of the affinity matrix sparsity (when number of samples increase), you may try to tune down the pieta in evaluation times, i.e. to 0.09. Remark that in issue 2, we clarified a shared z and logit would be beneficial for large scale training. Hence, in evaluation time, the affinity matrix can be written down as self_coeff = (z @ z.T).abs().unsqueeze(0) accordingly.

Time complexity

Unfortunately, yes. The method is not that efficient in computing large scale n by n matrices. Possible solutions for large datasets would be: first, clustering a subset of samples using the affinity matrices; then, assigning remained samples to existing clusters.

Hope this works for you.

Tianzhe

black0017 commented 5 months ago

Hello,

OK, I see. I will try to train and evaluate the model as you suggested and let you know whether I get the results you report. As a disclaimer, I am the author of one of the baselines (TEMI), and I have been heavily invested in the topic for the last ~1.5 years. My goal is to advance the field and find new methods that scale; I am not trying to criticize your work (actually, the opposite; I really like reading new papers on image clustering). With that in mind:

Since it's unsupervised training, we directly evaluate the model performance on 15,000 random samples from ImageNet training set.

However, this is not how results are evaluated from state-of-the-art image clustering methods, and it's misleading when you report results compared to previous methods (Table 2) while others (should) use the unseen validation images of each dataset.

This comparison is thus unfair and needs to be stated clearly to the research community. The limitations of your approach are not highlighted in the paper:

As a result, the above statement is incorrect, and I highly suggest you revise the camera-ready version of your manuscript and Arxiv preprint. You make several statements about CPP being a good candidate for large-scale datasets, which was the most interesting point, but this is not true! I would highly suggest you clarify in the appendix how the experimental results were produced: evaluation.

The authors of SCAN (https://github.com/wvangansbeke/Unsupervised-Classification) have highly emphasized the importance of using unseen images (val/test) set for the evaluation. The sensitivity of the hyperparameters like pieta is also hidden from the paper (limitation!). Essentially, how would someone choose this parameter for an arbitrary dataset??

Spectral clustering

Based on the above, my new question is, how were you able to run spectral clustering on imagenet? (compute the eigenvectors of n=10^6 samples???????)

In any case, I will run the evaluation as you suggested in the next few days and let you know whether I got the same result.

Thanks for getting back to me!

Nikolas

LeslieTrue / CPP