ShirAmir / dino-vit-features

Official implementation for the paper "Deep ViT Features as Dense Visual Descriptors".
https://dino-vit-features.github.io
MIT License
401 stars 47 forks source link

Part co-segmentation comparison on CUB #11

Open SDNAFIO opened 1 year ago

SDNAFIO commented 1 year ago

Hello, is it possible to release the evaluation code for CUB, which reproduces the results presented in the paper?

With the currently available implementation, I'm unfortunately not able to reproduce the results. I get much worse results for the NMI and ARI.

best regards

ShirAmir commented 1 year ago

Hi! Thank you for finding an interest in our paper. We used the same evaluation code and data partitions provided by Choudhury et al in this link and replaced their model with our par co-segmentation inference. For the large pairs, you can train the k-means on 1K images instead of all validation split, and apply k-means inference on all the images. Let us know if you have further questions!

SDNAFIO commented 1 year ago

Hi, thanks for the response!

In the meantime, I also tried to reproduce the inter-class Co-segmentation results for PASCAL-VOC. I also was not able to reproduce the numbers presented in the paper. Unfortunately, also none of the related methods listed in Table 2 seem to have published a complete example showing the entire evaluation. This makes it hard to even come up with the exact same data setup.

Which version of the PASCAL VOC was used (2012, ...)? Which split of the dataset was used for evaluation (train+val,...)? Was a different split used for training than for evaluation? The dataset contains also images without segmentation masks, did you include them in some way to fit the K-Means? Some images in the dataset are marked as difficult, have they been ignored or included?

Additionally, I am also not sure about the used hyperparameters. Did you use the same as in the uploaded Jupyter notebook for Co-Segmentation, or different ones?

Also, how were the final Jaccard Index and Mean Precision Scores computed? The dataset contains for each class a different number of images, did you just average the scores for each class, or did you weight them by the number of images?

Thank you already once again for the effort. If the evaluation code for the dataset could be published, it would, of course be best.