Closed Anuradha-Uggi closed 1 year ago
Hi, our evaluation setup is detailed in the paper and supplemental material - please check there first and feel free to reopen if you have specific questions.
Hi. You talk about splits of the datasets you evaluated. But nowhere I could find a discussion on which weights you use to evaluate on different datasets. In supplementary, section1&2 talk about the evaluation on different splits, and further sections discuss ablation experiments, qualitative results. This is still not clear for me for each of the validation datasets, which weights you use. Please clarify.
Thanks
See Section 4.1, second sentence.
Thank you. So, the weights trained on Pitts30k used for testing Pitts30k test data, weights from Tokyo247 for testing on Tokyo247 test data, similarly with Mapillary? Did you also train NetVLAD on Robot cars and CMU Seasons before testing on their test splits?
You have also mentioned '' note that no finetuning was performed on any of the datasets " in section 4.4 last sentence. Does this mean that all the results you reported in Table 1 are from the off-the-shelf models which are not finetuned on any other datasets? ''Ours" results are NetVLAD fine tuned results?
Just to reiterate: "We train the underlying vanilla NetVLAD feature extractor [3] on two datasets: Pittsburgh 30k [80] for urban imagery (Pittsburgh and Tokyo datasets), and Mapillary Street Level Sequences [82] for all other conditions."
We train on two datasets, Pitts and MSLS. These models are then tested on the datasets as described in this sentence, the Pitts model for Pitts (test split) and Tokyo24/7 (test split), and the MSLS for all other datasets.
"No finetuning" means exactly that - we did not use additional data to train beyond the setup described above.
"Ours" are Patch-NetVLAD results.
I ran PatchNetVLAD trained on pitts30k on Mapillary val split. This gives: NetVLAD all_recall@1: 0.580 all_recall@5: 0.720 all_recall@10: 0.761 all_recall@20: 0.785 Which match with the one in Table 1 under Mapillary (val). Patch-NetVLAD: all_recall@1: 0.734 all_recall@5: 0.801 all_recall@10: 0.828 all_recall@20: 0.849 Which are little lower than the reported ones. The same testing when I did with Mapillary trained models, NetVLAD: all_recall@1: 0.711 all_recall@5: 0.815 all_recall@10: 0.843 all_recall@20: 0.880, and Patch-NetVLAD: all_recall@1: 0.808 all_recall@5: 0.865 all_recall@10: 0.884 all_recall@20: 0.904
What my doubt is that is it fair to compare NetVLAD results (trained on pitts30k) with Patch-NetVLAD results (trained on Mapillary) on the same test data? Most scenarios a model which sees more varieties during its training performs better than a model which sees a fewer variety of samples right?
Hello authors.
Thanks for making this work public. I just had a doubt. How are you doing testing exactly? is it like training on every train data (Pitts30k, MSLS, Robotcars, CMU seasons, etc) and then testing on the corresponding test data? or training on Pittsburgh train data alone and testing on all other test datasets (Where a few are out of distribution datasets)? Please clarify.
Many thanks!