Some questions about pytorch_kmeans and reproducing results on Scars and Aircraft

SarahRastegar / SelEx

Official Repository of "SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery" (ECCV 2024)

18 stars 1 forks source link

Some questions about pytorch_kmeans and reproducing results on Scars and Aircraft #7

Open zhenqi-he opened 1 week ago

zhenqi-he commented 1 week ago

Hi, thank you for sharing such an innovative project! I've encountered some challenges while attempting to reproduce the results and would appreciate any guidance. Specifically, I've run into issues related to KMeans, kmeans_pytorch, and hyperparameter settings, and haven’t found clear solutions in the documentation or existing issues.

KMeans & pytorch_kmeans: In this comment, you mentioned that PyTorch KMeans provides faster and more effective clustering for generating pseudolabels. However, I noticed that the kmeans_pytorch (imported as kmeans) is only called in the function test_kmeans(), which is only used to test the performance on unlabeled train data at the last epoch and it is not called in SemiSupKMeans (imported from methods.clustering.faster_mix_k_means_pytorch.K_Means)(As the default train_report_interval is set to 200, you may see the codes I am talking about through the below link). Could you clarify the intended use of pytorch_kmeans within this context?https://github.com/SarahRastegar/SelEx/blob/7b5fdc0659d185de7c5c45b653fec162a8402b6b/methods/contrastive_training/contrastive_training.py#L621 I also noticed that during the training, for each epoch, the accuracy will be computed by the results of SemiSupKMeans(), and at the last epoch, you are using test_kmeans() for the label assignment, I would like to ask whether you use the best performance during training or use the performance at the last epoch as the final performance in the paper. https://github.com/SarahRastegar/SelEx/blob/7b5fdc0659d185de7c5c45b653fec162a8402b6b/methods/contrastive_training/contrastive_training.py#L552
Hyperparameter Settings and Performance Variations I attempted to reproduce the experiments for SCars, Aircrafts and CUB with the hyperparameters in README (e.g. unsupervised_smoothing = 1 for CUB, and 0.5 for SCars and Aircrats), while, I only got relatively low performance for Scars (around All: 46) and Aircrats (around All: 55). I noticed that you mentioned in https://github.com/SarahRastegar/SelEx/issues/6#issuecomment-2425996202 that there may be some variations when using different GPU architectures. I am using NVIDIA GeForce RTX 3090 and NVIDIA L40 where I achieved consistent performance on these two GPU architectures. Do you have any suggestions on how I might address this issue? Is it because I am using different GPUs?

Your guidance on this issue would be greatly appreciated. Thank you again for your hard work, and I look forward to your response!

SarahRastegar commented 1 week ago

Thank you for your interest in our work and the detailed observations!

Initially, I used kmean_pytorch instead of SemiSupKMeans() for quite some time, which indeed provided effective clustering, this is why I mentioned it having a positive ripple impact on performance, which eventually led us to switch to the better SemiSupKMeans(). In this context, you’re correct: kmean_pytorch is now only used for selecting the best checkpoint, so while it still influences performance to some degree, it no longer has the same broader effect on the overall model as before. All our reported results are based on the best checkpoint for generic and the last checkpoint for fine-grained.
Are you using PyTorch 2.x? I’ve also run on NVIDIA GeForce RTX 3090 for the two datasets, and your results shouldn't be this low, particularly for SCars. If possible, could you also share the breakdown between known and novel results? Additionally, you might see improvements with a lower smoothing value, around 0.4. Generally, 3090 should yield better performance than this, so let’s see if these tweaks help.

zhenqi-he commented 1 week ago

Thank you so much for your detailed and prompt reply! I'm still struggling with the SCars dataset and would greatly appreciate any guidance you could provide.

I noticed that the results computed by KMeans (no matter pytorch_kmeans or sklearn kmeans) is relatively low compared with the results computed by SemiKmeans. For fine-grained dataset, do you use the last checkpoint with Kmeans or SemiKmeans to compute the results? E.g. For aircraft: Last checkpoint with SemiKmeans: All 0.5433 | Old 0.6206 | New 0.5046; Last checkpoint with KMeans: All 0.4455 | Old 0.4190 | New 0.4588.
I am currently using PyTorch 2.0.0 and running the following command. Will the version of PyTorch affect the final results? However, for the SCars dataset, I consistently fail to achieve satisfactory results, which has been troubling me for a long time. I would greatly appreciate any insights you could provide to help identify potential issues. The results I obtained with these commands are: DINOv1: All 0.4860 | Old 0.6887 | New 0.3978

python -m methods.contrastive_training.contrastive_training \ --dataset_name 'scars' \ --batch_size 128 \ --grad_from_block 11 \ --epochs 200 \ --base_model vit_dino \ --num_workers 4 \ --use_ssb_splits 'True' \ --sup_con_weight 0.35 \ --weight_decay 5e-5 \ --contrast_unlabel_only 'False' \ --transform 'imagenet' \ --lr 0.1 \ --eval_funcs 'v1' 'v2' \ --unsupervised_smoothing 0.5

Great thanks for your help in advance!

zhenqi-he commented 1 week ago

I just found that, the default value of grad_from_block is set to 10, while I directly utilize codes in contrastive_train.sh, which is set to 11. This difference might be the root of the issue. I plan to rerun the experiment, this time freezing only the first 10 blocks. https://github.com/SarahRastegar/SelEx/blob/7b5fdc0659d185de7c5c45b653fec162a8402b6b/contrastive_train.sh#L18

Thank you again for your detailed response—this is truly an amazing project!

fallpavilion commented 1 week ago

I was also troubled by the same issue as you. I set grad_from_block to 10 and unsupervised_smoothing to 0.5, but there are still discrepancies with the results in the paper, especially when using DINOv2. I have attached the results I reproduced and hope we can discuss the problem together. scars DINOv1 Train ACC Unlabelled_v1: All 0.5702 | Old 0.7741 | New 0.4716 Train ACC Unlabelled_v2: All 0.5658 | Old 0.7786 | New 0.4630 DINOv2 Train ACC Unlabelled_v1: All 0.7982 | Old 0.9310 | New 0.7340 Train ACC Unlabelled_v2: All 0.7925 | Old 0.9195 | New 0.7311 My replication results currently show a significant gap only on the scars dataset, while other datasets are basically within the margin of error.

zhenqi-he commented 1 week ago

Hi, I would like to ask where you downloaded the SCars dataset. For the SCars dataset I downloaded, it does not have the .csv files so I indeed use the dataset written in SimGCD.

fallpavilion commented 1 week ago

I am also using the scars dataset from simGCD; could this be the key reason why we are unable to reproduce the results?

SarahRastegar commented 6 days ago

Hi! Yes, the defaults are indeed used in the Python script, but I also fixed grad_from_block to 10 in the bash script—thanks for pointing that out.

Let me know if this resolves your issue, and feel free to reach out if you have more questions!

zhenqi-he commented 6 days ago

Hi, Great thanks for sharing the dataset and all the work you’ve put into this project. I’ve encountered some issues that I wanted to bring to your attention during testing with the dataset you provided:

I think for the dataset you shared, the name of csv files should be renamed as cars_test.csv and cars_train.csv for the code to run properly.
When using both the dataset you shared and the original SCars dataset specified in the project, I was able to successfully reproduce the results reported in the paper. Here are my results using DINOv2: Train ACC Unlabelled_v2: All 0.9103 | Old 0.9280 | New 0.9018 The results using DINOv2 based on SImGCD's implementation of SCar dataset is: Train ACC Unlabelled_v2: All 0.8204 | Old 0.9450 | New 0.7454
The significant discrepancy observed when using the same dataset in different formats is puzzling. To investigate further, I tested the data you provided against the CarsDataset() and get_scars_datasets() implementation in your code. During this testing, I discovered issues such as mix-ups between labeled and unlabeled data and inconsistencies in annotations. For example, using seed 0 on an NVIDIA GeForce RTX 3090, the image named 04892.jpg was assigned target ID 5 in the labeled dataset but also appeared in the unlabeled dataset with target ID 2. Upon verification, the correct label for 04892.jpg should indeed be class 2.

I would appreciate it if you could review the SCars data you are using along with the dataset-related code carefully. The following observations may be useful: the metadata you are using appears to be derived from cars_annos.mat by the code shown below, yet there are inconsistencies between the annotations in cars_annos.mat and those in cars_train.csv. For example, 00001.jpg is labeled as class 1 in cars_annos.mat but is marked as class 14 in cars_train.csv. https://github.com/SarahRastegar/SelEx/blob/12d1ac64db8bc1e99eb2a8e78c67c8497778f76b/data/stanford_cars.py#L43

zhenqi-he commented 5 days ago

Hi, you may download the images from the following link in kaggle: Stanford Cars Dataset. And you may find the annotations (cars_train_annos.mat and cars_test_annos_withlabels.mat)at the following location: Annotations. And I used the dataset implementation written in SimGCD.

SarahRastegar commented 4 days ago

Thank you very much for providing the annotations and dataset. It appears that our scars results, particularly on DinoV2, require updating. I am currently coordinating with ECCV to determine if an update is feasible, and the revised version will be available on arXiv shortly. Thanks again for your insight and feedback.