Master-cai commented 1 year ago

Dear author, thanks for your great work! Recently i'm trying to reproduce your results, and i have some problem about it.

For training, I use your default setting(only change bs=8) to train TopicFMfast and TopicFMplus on unprocessed MegaDepth dataset. For testing, I also use the default setting to test on MegaDepth-1500 and Scannet-1500.

Here are the results, "paper" means the number you reported in your paper; "Pretrained" means tested with the weights you provided; "reproduce" means tested with the weights i trained:

MegaDepth-1500

<!DOCTYPE html>

name	auc@5	auc@10	auc@20	prec@5e-04
TopicFM-fast(Paper)	56.2	71.9	82.9	-
TopicFM-fast(Pretrained)	0.5606952	0.71539569	0.82653555	0.93822512
TopicFM-fast(reproduce)	0.52482495	0.68158841	0.79560471	0.94266812
TopicFM+(paper)	58.2	72.8	83.2	-
TopicFM+(Pretrained)	0.56792696	0.72138904	0.83045042	0.95216762
TopicFM+(reproduce)	0.56648886	0.71682282	0.82511755	0.94402297

Scannet

<!DOCTYPE html> name	auc@5	auc@10	auc@20	prec@5e-04
TopicFM-fast(Paper)	19.7	36.7	52.7	-
TopicFM-fast(Pretrained)	0.18852710073235984	0.3612222633840676	0.5254263432725385	0.7259958315195613
TopicFM-fast(reproduce)	0.19810854534909592	0.3723126894809127	0.5360527855679162	0.7568452620821726
TopicFM+(paper)	20.4	38.5	54.5
TopicFM+(Pretrained)	0.19944785341139995	0.3751533539225028	0.5389162160798258	0.6811763544889461
TopicFM+(reproduce)	0.20168166013885477	0.38085154323096326	0.5433350728906183	0.6783286089835198

Here are some questions:

The "pretrained" model is not as accurate as reported(both megaDepth and Scannet, fast and plus). Are there some special parameters and tricks you used when testing to reach the reported performances?
There are still some gaps between "pretrained" and "reproduce" when testing on MegaDepth, even though they are test under the same setting. But the results of "pretrained" and "reproduce" on Scannet are close. I don't know what causes the gaps on MegaDepth.
An unrelated question: when I run the test, I change the batch_size=2 in the scripts (e.g “scripts/reproduce_test/outdoor.sh"), but it seems to be working and the test are still performed one by one(batch_size=1). Is it normal？

Thanks you again!

TruongKhang commented 1 year ago

Hi @Master-cai ,

Thank you for your questions. I would like to answer them as follows:

Did you set up your experimental environments as I described in README (CUDA, package versions)? All the parameter settings are provided in the config files. If you already did the setups in README, I think the difference is due to the GPU architecture or the randomness of the RANSAC algorithm in OpenCV. In my case, I tested the pre-trained models on Ubuntu 16.04 with NVIDIA TESLA V100. And if you try to change the RANSAC thresholds in your environment, you might get better results. Updates: I've tried to test with modern GPUs such as NVIDIA GeForce RTX, and the results are similar to yours. For older GPUs such as NVIDIA TESLA V100 or NVIDIA Titan Xp, the results are similar to the paper's.
This is a good question. For MegaDepth, you may see that TopicFM+ (pre-trained) and TopicFM+ (reproduce) are almost similar. For the large gap between TopicFM-fast (pretrained) and TopicFM-fast (reproduce), I didn't use the data augmentation step in megadepth.py (see line 93). Using geometric augmentation might reduce the performance on MegaDepth but it increased the performance on ScanNet (as you see TopicFM-fast (reproduce) on ScanNet).
For testing, the batch_size is set to 1 in the evaluation code. This is highly recommended. So your change of batch_size won't make any difference in this case.

I hope this will address your concerns. Thank you!

Master-cai commented 1 year ago

Thanks for your time and careful answers!

1. ...I've tried to test with modern GPUs such as NVIDIA GeForce RTX, and the results are similar to yours. For older GPUs such as NVIDIA TESLA V100 or NVIDIA Titan Xp, the results are similar to the paper's.

I tested on NVIDIA GeForce RTX 3090 with Ubuntu 20. It's interesting that the different GPUs can make such a difference. But when I test LoFTR, the results i got is the reported results, exactly. I 'm wondering the reason. And i will try to tune the RANSAC thresholds.

2. For the large gap between TopicFM-fast (pretrained) and TopicFM-fast (reproduce), I didn't use the data augmentation step in megadepth.py (see line 93).

Just to clarify, the TopicFM-fast model you provided are trained without data augmentation and the plus model are with it? Why not keep the fast model and plus model the same? It's a little confused.

3. For testing, the batch_size is set to 1 in the evaluation code. This is highly recommended. So your change of batch_size won't make any difference in this case.

I'd like to ask one more question. Why bs=1 is recommended? I test bs=2 just now and see no obvious gap of results. Howerver, the time cost doesn't decrease which is unexpected. I guess the reason is that the bottleneck is the metrics computing, am i right?

Looking forward to your reply, Thanks!

TruongKhang commented 1 year ago

Hello @Master-cai,

Compared to LoFTR, I upgraded many packages to newer versions including pytorch-lightning or pytorch. This might be the reason. I'll try to figure this out. Any help is welcome! In the paper, I reported the results based on my experimental environments.
I just chose the models that gave the best performance. For the TopicFM-fast, its coarse module cannot fit well to the augmented data because of its simple merging strategy. But for TopicFM+, the coarse module is more robust, that's why I use the geometric rotation for this model to improve the performance on other datasets such as ScanNet, HPatches, etc. My goal in providing two model variants is that we choose TopicFM-fast if we prefer efficiency and choose TopicFM+ if we prefer accuracy.
I recommend the batch_size to 1 because the image size might not be the same in general. If the image size is the same for all images in the dataset, then we can increase the batch_size. And did you change the batch_size in the evaluation code (line 87)? Let me try! Yeah, you're right. The metric computing took much longer time than the matching models

Thank you!

Master-cai commented 1 year ago

@TruongKhang

I will try to find the reason why they can keep the results consistent and report in this issue if i find.

3. Did you change the batch_size in the evaluation code (line 87)?

yes, according to your guidance, i change the code and it takes effect.

        self.test_loader_params = {
            'batch_size': args.batch_size,
            'shuffle': False,
            'num_workers': args.num_workers,
            'pin_memory': True
        }

Thanks!

TruongKhang commented 1 year ago

@Master-cai, anyway, thank you for your engaging discussion. I think the lightning framework might not be stable until now. I also tried to train with mix-decision fp16 in the past to reduce the training time and memory. But it had some errors that I can't solve!

Master-cai commented 1 year ago

@TruongKhang fp16 is very unstable as its range is too small(about 1e4). I suggest you can try 'bf16', which is much more stable and have a large range(I have tried bf16 with your code, it is compatible). But it need GPU of Ampere architecture(such as GTX 30 series). I help it will help you.

TruongKhang / TopicFM

About reproduce performances #16

MegaDepth-1500

Scannet