How recalls on MSLS, Robotcar seasons, and CMU seasons are produced?

QVPR / Patch-NetVLAD

Code for the CVPR2021 paper "Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition"

MIT License

531 stars 76 forks source link

How recalls on MSLS, Robotcar seasons, and CMU seasons are produced? #84

Closed Anuradha-Uggi closed 1 year ago

Anuradha-Uggi commented 1 year ago

Hello authors.

Thanks for making this work public. I just had a doubt. How are you doing testing exactly? is it like training on every train data (Pitts30k, MSLS, Robotcars, CMU seasons, etc) and then testing on the corresponding test data? or training on Pittsburgh train data alone and testing on all other test datasets (Where a few are out of distribution datasets)? Please clarify.

Many thanks!

Tobias-Fischer commented 1 year ago

Hi, our evaluation setup is detailed in the paper and supplemental material - please check there first and feel free to reopen if you have specific questions.

Anuradha-Uggi commented 1 year ago

Hi. You talk about splits of the datasets you evaluated. But nowhere I could find a discussion on which weights you use to evaluate on different datasets. In supplementary, section1&2 talk about the evaluation on different splits, and further sections discuss ablation experiments, qualitative results. This is still not clear for me for each of the validation datasets, which weights you use. Please clarify.

Thanks

Tobias-Fischer commented 1 year ago

See Section 4.1, second sentence.

Anuradha-Uggi commented 1 year ago

Thank you. So, the weights trained on Pitts30k used for testing Pitts30k test data, weights from Tokyo247 for testing on Tokyo247 test data, similarly with Mapillary? Did you also train NetVLAD on Robot cars and CMU Seasons before testing on their test splits?

You have also mentioned '' note that no finetuning was performed on any of the datasets " in section 4.4 last sentence. Does this mean that all the results you reported in Table 1 are from the off-the-shelf models which are not finetuned on any other datasets? ''Ours" results are NetVLAD fine tuned results?

Tobias-Fischer commented 1 year ago

Just to reiterate: "We train the underlying vanilla NetVLAD feature extractor [3] on two datasets: Pittsburgh 30k [80] for urban imagery (Pittsburgh and Tokyo datasets), and Mapillary Street Level Sequences [82] for all other conditions."

We train on two datasets, Pitts and MSLS. These models are then tested on the datasets as described in this sentence, the Pitts model for Pitts (test split) and Tokyo24/7 (test split), and the MSLS for all other datasets.

"No finetuning" means exactly that - we did not use additional data to train beyond the setup described above.

"Ours" are Patch-NetVLAD results.

Anuradha-Uggi commented 1 year ago

I ran PatchNetVLAD trained on pitts30k on Mapillary val split. This gives: NetVLAD all_recall@1: 0.580 all_recall@5: 0.720 all_recall@10: 0.761 all_recall@20: 0.785 Which match with the one in Table 1 under Mapillary (val). Patch-NetVLAD: all_recall@1: 0.734 all_recall@5: 0.801 all_recall@10: 0.828 all_recall@20: 0.849 Which are little lower than the reported ones. The same testing when I did with Mapillary trained models, NetVLAD: all_recall@1: 0.711 all_recall@5: 0.815 all_recall@10: 0.843 all_recall@20: 0.880, and Patch-NetVLAD: all_recall@1: 0.808 all_recall@5: 0.865 all_recall@10: 0.884 all_recall@20: 0.904

What my doubt is that is it fair to compare NetVLAD results (trained on pitts30k) with Patch-NetVLAD results (trained on Mapillary) on the same test data? Most scenarios a model which sees more varieties during its training performs better than a model which sees a fewer variety of samples right?