Parskatt / DeDoDe

[3DV 2024 Oral] DeDoDe 🎶 Detect, Don't Describe --- Describe, Don't Detect, for Local Feature Matching
MIT License
365 stars 18 forks source link

Could you release small/tiny/nano version of detector and descriptor? #30

Open zhongqiu1245 opened 4 months ago

zhongqiu1245 commented 4 months ago

Hello, thank you for your amazing job! I'm really interesting of your job and want to deploy DeDoDe on mobile devices(laptop, even CPU) for some self-driving works. But I find it is too heavy for mobile device to run DeDoDeDescriptorB, DeDoDeDetectorL. In my computer(RTX4060 mobile 8G), only 5.4 fps when inputs with 640*480 (tensorrt_fp16) Could you release small/tiny/nano version of detector and descriptor? Thank you in advance!

Parskatt commented 4 months ago

Sure, the easiest I guess would be using vgg11 and reducing layers further. Should be doable. Not sure how much performance will degrade.

zhongqiu1245 commented 4 months ago

about 30fps in RTX4060 mobile 8G.

Parskatt commented 4 months ago

@zhongqiu1245 could you try out the small detector in the branch that references this issue?

Weights can be found here: https://github.com/Parskatt/DeDoDe/releases/tag/v2

Parskatt commented 4 months ago

It uses a VGG11 backbone and I reduced the number of layers at each scale from 8 -> 4 and cut the dimensionality in half. I think it should be about 3-4X faster than the _L detector. Could you verify?

Parskatt commented 4 months ago

Depending on your application it might also be possible to increase the framerate by batching, is this an option for you?

zhongqiu1245 commented 4 months ago

@Parskatt Sorry for reply so late. I will verify this. Thank you!

zhongqiu1245 commented 4 months ago

@Parskatt Thank you for your DetectorS! The fps increases rapidly, but still lower than 30fps (15.9fps, DetectorS + DescriptorB, 640*480, tensorrt fp16).

So I reduce the shape of img to 320 * 240, then fps=25, almost there. Could you release a small version of Descriptor? Like DescriptorS? Maybe this can help DoDeDo breaks up the limitation of 30fps. Thank you!

Parskatt commented 4 months ago

Sure, then I think we can also reduce descriptor size. Does 128 sound better? Is descriptor dimensinality a concern?

zhongqiu1245 commented 4 months ago

Thank you for your reply ! 128 sounds better. Yes, dim is an important factor which can speed up/slow down the inference time of net.The dim is smaller, the speed is faster. However, if dim is too small, it will cause bad performance. I thought dim=64 before but I thought it maybe too small. 128 maybe better :) Thank you for your generous!

zhongqiu1245 commented 4 months ago

some details: resolution: (480, 640) preprocess: 19.606828689575195ms detectorS: 16.09945297241211ms descriptorB: 29.36267852783203ms dualsoftmaxmatcher: 0.6873607635498047ms postprocess: 0.14138221740722656ms total: 65.89770317077637ms fps: 15.207663468720314 detectorS & descriptorB are trt_fp16

Parskatt commented 4 months ago

Okay, so seems like around 20fps is at least possible with current sizes.

Are you able to extract the times for the encoder/decoder parts of the network? Depending on what is taking most time might need to change enc architecture.

The final thing I guess would be to distill both networks into a single network.

zhongqiu1245 commented 4 months ago

ok, I will try later.