Open zhenqi-he opened 1 week ago
Thank you for your interest in our work and the detailed observations!
Initially, I used kmean_pytorch
instead of SemiSupKMeans()
for quite some time, which indeed provided effective clustering, this is why I mentioned it having a positive ripple impact on performance, which eventually led us to switch to the better SemiSupKMeans()
. In this context, you’re correct: kmean_pytorch
is now only used for selecting the best checkpoint, so while it still influences performance to some degree, it no longer has the same broader effect on the overall model as before. All our reported results are based on the best checkpoint for generic and the last checkpoint for fine-grained.
Are you using PyTorch 2.x? I’ve also run on NVIDIA GeForce RTX 3090 for the two datasets, and your results shouldn't be this low, particularly for SCars. If possible, could you also share the breakdown between known and novel results? Additionally, you might see improvements with a lower smoothing value, around 0.4. Generally, 3090 should yield better performance than this, so let’s see if these tweaks help.
Thank you so much for your detailed and prompt reply! I'm still struggling with the SCars dataset and would greatly appreciate any guidance you could provide.
I noticed that the results computed by KMeans (no matter pytorch_kmeans or sklearn kmeans) is relatively low compared with the results computed by SemiKmeans. For fine-grained dataset, do you use the last checkpoint with Kmeans or SemiKmeans to compute the results? E.g. For aircraft: Last checkpoint with SemiKmeans: All 0.5433 | Old 0.6206 | New 0.5046; Last checkpoint with KMeans: All 0.4455 | Old 0.4190 | New 0.4588.
I am currently using PyTorch 2.0.0 and running the following command. Will the version of PyTorch affect the final results? However, for the SCars dataset, I consistently fail to achieve satisfactory results, which has been troubling me for a long time. I would greatly appreciate any insights you could provide to help identify potential issues. The results I obtained with these commands are: DINOv1: All 0.4860 | Old 0.6887 | New 0.3978
python -m methods.contrastive_training.contrastive_training \ --dataset_name 'scars' \ --batch_size 128 \ --grad_from_block 11 \ --epochs 200 \ --base_model vit_dino \ --num_workers 4 \ --use_ssb_splits 'True' \ --sup_con_weight 0.35 \ --weight_decay 5e-5 \ --contrast_unlabel_only 'False' \ --transform 'imagenet' \ --lr 0.1 \ --eval_funcs 'v1' 'v2' \ --unsupervised_smoothing 0.5
Great thanks for your help in advance!
I just found that, the default value of grad_from_block is set to 10, while I directly utilize codes in contrastive_train.sh
, which is set to 11. This difference might be the root of the issue. I plan to rerun the experiment, this time freezing only the first 10 blocks.
https://github.com/SarahRastegar/SelEx/blob/7b5fdc0659d185de7c5c45b653fec162a8402b6b/contrastive_train.sh#L18
Thank you again for your detailed response—this is truly an amazing project!
I was also troubled by the same issue as you. I set grad_from_block to 10 and unsupervised_smoothing to 0.5, but there are still discrepancies with the results in the paper, especially when using DINOv2. I have attached the results I reproduced and hope we can discuss the problem together. scars DINOv1 Train ACC Unlabelled_v1: All 0.5702 | Old 0.7741 | New 0.4716 Train ACC Unlabelled_v2: All 0.5658 | Old 0.7786 | New 0.4630 DINOv2 Train ACC Unlabelled_v1: All 0.7982 | Old 0.9310 | New 0.7340 Train ACC Unlabelled_v2: All 0.7925 | Old 0.9195 | New 0.7311 My replication results currently show a significant gap only on the scars dataset, while other datasets are basically within the margin of error.
Hi, I would like to ask where you downloaded the SCars dataset. For the SCars dataset I downloaded, it does not have the .csv files so I indeed use the dataset written in SimGCD.
I am also using the scars dataset from simGCD; could this be the key reason why we are unable to reproduce the results?
Hi! Yes, the defaults are indeed used in the Python script, but I also fixed grad_from_block
to 10 in the bash script—thanks for pointing that out.
Let me know if this resolves your issue, and feel free to reach out if you have more questions!
Hi, Great thanks for sharing the dataset and all the work you’ve put into this project. I’ve encountered some issues that I wanted to bring to your attention during testing with the dataset you provided:
cars_test.csv
and cars_train.csv
for the code to run properly.CarsDataset()
and get_scars_datasets()
implementation in your code. During this testing, I discovered issues such as mix-ups between labeled and unlabeled data and inconsistencies in annotations. For example, using seed 0 on an NVIDIA GeForce RTX 3090, the image named 04892.jpg
was assigned target ID 5 in the labeled dataset but also appeared in the unlabeled dataset with target ID 2. Upon verification, the correct label for 04892.jpg
should indeed be class 2.I would appreciate it if you could review the SCars data you are using along with the dataset-related code carefully. The following observations may be useful: the metadata you are using appears to be derived from cars_annos.mat
by the code shown below, yet there are inconsistencies between the annotations in cars_annos.mat
and those in cars_train.csv
. For example, 00001.jpg is labeled as class 1 in cars_annos.mat
but is marked as class 14 in cars_train.csv
.
https://github.com/SarahRastegar/SelEx/blob/12d1ac64db8bc1e99eb2a8e78c67c8497778f76b/data/stanford_cars.py#L43
Hi, you may download the images from the following link in kaggle: Stanford Cars Dataset. And you may find the annotations (cars_train_annos.mat and cars_test_annos_withlabels.mat)at the following location: Annotations. And I used the dataset implementation written in SimGCD.
Thank you very much for providing the annotations and dataset. It appears that our scars results, particularly on DinoV2, require updating. I am currently coordinating with ECCV to determine if an update is feasible, and the revised version will be available on arXiv shortly. Thanks again for your insight and feedback.
Hi, thank you for sharing such an innovative project! I've encountered some challenges while attempting to reproduce the results and would appreciate any guidance. Specifically, I've run into issues related to KMeans, kmeans_pytorch, and hyperparameter settings, and haven’t found clear solutions in the documentation or existing issues.
KMeans & pytorch_kmeans: In this comment, you mentioned that PyTorch KMeans provides faster and more effective clustering for generating pseudolabels. However, I noticed that the
kmeans_pytorch
(imported as kmeans) is only called in the functiontest_kmeans()
, which is only used to test the performance on unlabeled train data at the last epoch and it is not called in SemiSupKMeans (imported from methods.clustering.faster_mix_k_means_pytorch.K_Means)(As the default train_report_interval is set to 200, you may see the codes I am talking about through the below link). Could you clarify the intended use of pytorch_kmeans within this context?https://github.com/SarahRastegar/SelEx/blob/7b5fdc0659d185de7c5c45b653fec162a8402b6b/methods/contrastive_training/contrastive_training.py#L621 I also noticed that during the training, for each epoch, the accuracy will be computed by the results ofSemiSupKMeans()
, and at the last epoch, you are usingtest_kmeans()
for the label assignment, I would like to ask whether you use the best performance during training or use the performance at the last epoch as the final performance in the paper. https://github.com/SarahRastegar/SelEx/blob/7b5fdc0659d185de7c5c45b653fec162a8402b6b/methods/contrastive_training/contrastive_training.py#L552Hyperparameter Settings and Performance Variations I attempted to reproduce the experiments for SCars, Aircrafts and CUB with the hyperparameters in README (e.g. unsupervised_smoothing = 1 for CUB, and 0.5 for SCars and Aircrats), while, I only got relatively low performance for Scars (around All: 46) and Aircrats (around All: 55). I noticed that you mentioned in https://github.com/SarahRastegar/SelEx/issues/6#issuecomment-2425996202 that there may be some variations when using different GPU architectures. I am using NVIDIA GeForce RTX 3090 and NVIDIA L40 where I achieved consistent performance on these two GPU architectures. Do you have any suggestions on how I might address this issue? Is it because I am using different GPUs?
Your guidance on this issue would be greatly appreciated. Thank you again for your hard work, and I look forward to your response!