Open zhongqiu1245 opened 4 months ago
Sure, the easiest I guess would be using vgg11 and reducing layers further. Should be doable. Not sure how much performance will degrade.
about 30fps in RTX4060 mobile 8G.
@zhongqiu1245 could you try out the small detector in the branch that references this issue?
Weights can be found here: https://github.com/Parskatt/DeDoDe/releases/tag/v2
It uses a VGG11 backbone and I reduced the number of layers at each scale from 8 -> 4 and cut the dimensionality in half. I think it should be about 3-4X faster than the _L detector. Could you verify?
Depending on your application it might also be possible to increase the framerate by batching, is this an option for you?
@Parskatt Sorry for reply so late. I will verify this. Thank you!
@Parskatt Thank you for your DetectorS! The fps increases rapidly, but still lower than 30fps (15.9fps, DetectorS + DescriptorB, 640*480, tensorrt fp16).
So I reduce the shape of img to 320 * 240, then fps=25, almost there. Could you release a small version of Descriptor? Like DescriptorS? Maybe this can help DoDeDo breaks up the limitation of 30fps. Thank you!
Sure, then I think we can also reduce descriptor size. Does 128 sound better? Is descriptor dimensinality a concern?
Thank you for your reply ! 128 sounds better. Yes, dim is an important factor which can speed up/slow down the inference time of net.The dim is smaller, the speed is faster. However, if dim is too small, it will cause bad performance. I thought dim=64 before but I thought it maybe too small. 128 maybe better :) Thank you for your generous!
some details: resolution: (480, 640) preprocess: 19.606828689575195ms detectorS: 16.09945297241211ms descriptorB: 29.36267852783203ms dualsoftmaxmatcher: 0.6873607635498047ms postprocess: 0.14138221740722656ms total: 65.89770317077637ms fps: 15.207663468720314 detectorS & descriptorB are trt_fp16
Okay, so seems like around 20fps is at least possible with current sizes.
Are you able to extract the times for the encoder/decoder parts of the network? Depending on what is taking most time might need to change enc architecture.
The final thing I guess would be to distill both networks into a single network.
ok, I will try later.
Hello, thank you for your amazing job! I'm really interesting of your job and want to deploy DeDoDe on mobile devices(laptop, even CPU) for some self-driving works. But I find it is too heavy for mobile device to run DeDoDeDescriptorB, DeDoDeDetectorL. In my computer(RTX4060 mobile 8G), only 5.4 fps when inputs with 640*480 (tensorrt_fp16) Could you release small/tiny/nano version of detector and descriptor? Thank you in advance!