cvlab-epfl / disk

Disk code release
Apache License 2.0
317 stars 46 forks source link

questions about disk's features #1

Closed Jemmagu closed 3 years ago

Jemmagu commented 4 years ago

Hi, @jatentaki Thank you for your great work! I have some further questions:

  1. have you tested disk's result on aachen localization benchmark and ETH benchmark (for 3D construction tasks)? How's the result?
  2. why you choose unet as feature extraction network? or have you compared with other networks?
  3. have you tried to do feature detection in different layers? (just like multi-scale strategy)
  4. I found disk's keypoints number is less than other SOTA methods, but still works great. What's the major reason, is it because of the grid strategy during training?

Seems many questions...however, really looking forward to your reply. Thanks a lot in advance!

jatentaki commented 4 years ago

Hello.

  1. We have tested on some scenes from the ETH benchmark. The results are reported in our paper (section 4.3, page 8). I am sceptical of this benchmark in general because it does not compare with any ground truth. In fact, I have tried DISK on the Gendarmenmarkt scene, which includes two similarly looking churches and although I got very competitive numbers (benchmark-wise), the reconstruction was completely wrong because the two churches got merged into a single thing. This particular issue, collapse of similar structures, is the main issue I found using DISK with COLMAP. I believe it stems partially from our training method: firstly, most tourist landmarks are old [-> weathered, therefore strongly textured] and asymmetric, secondly we only train on pairs of covisible images, which doesn't expose the network to the fact that there may be similar but non-matching structures. Largely, though, I would say this is an issue to be fixed on the SfM algorithm side. COLMAP starts with the initial image and tries to match it with as many others as possible, before going on to further away regions. At the same time, it has no method to recover from initial errors, so two similarly looking buildings will likely be matched together and collapsed, even though the overall structure of image-to-image matches could be explained much better with a different covisibility graph. We have not tested on Aachen.
  2. I have not tested other architectures. UNet is the go-to dense regression network for me, so I just used it and didn't spend time optimizing this aspect.
  3. No.
  4. What do you mean by less than SOTA? What methods do you have in mind and how do you run DISK (detect.py flags, image resolution, etc)? Going back to the remark in section 4.3 of our paper, DISK ran without limiting feature numbers yields 60k+ detections (which mostly are correctly triangulated into landmarks), more than 4x that of other methods.
Jemmagu commented 4 years ago

Hi @jatentaki , for the fourth question, i mean in the inference time, for 640*480 images, i can extract about 2.5k~3.5k keypoints without limiting feature numbers (just using detect.py), and matches quite good. This is quite good because less keypoints can also work well. So i'm quite interested what's the major reason/strategy. Is it because of the grid strategy during training, or actually you didn't mean to reduce the kpt number in inference time, and disk just learned a differentiable score map?

jatentaki commented 3 years ago

For 640*480 I think it's probably mostly the lack of multi-scale extraction. That said, it's difficult to comment without knowing the specifics of the dataset.