facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
8.32k stars 700 forks source link

3D object detection #369

Open hoangsep opened 5 months ago

hoangsep commented 5 months ago

What do you guys think about using multiple cameras with dinov2 for 3D object detection for robotics? Does it make sense?

ccharest93 commented 5 months ago

the model takes one image as an input. You can process your multiple image sequentially, but then they wouldn't share any information. There is probably better models out there for that, but it could still be interesting to try

dingkwang commented 5 months ago

That's certainly possible. @hoangsep We can work on this together.

hoangsep commented 5 months ago

@ccharest93 are you aware of any better model for this task? I am a total noob so I am not sure how this can be done. I wonder how companies like Tesla do 3D object detection.

I am thinking of something like stitching multiple camera image together (maybe side by side) and run them through the network? Or have multiple networks running in parallel, then take all the output (from 1 of the top layers) and pass them though a second network.

hoangsep commented 5 months ago

@dingkwang I would love to. I am a total noob so I probably won't be able to do much, but I would love to explore this.

ccharest93 commented 5 months ago

I haven't looked at 3D models, you would probably need something more than stitching. Models are great at learning but you want to give them as much prior information as possible. Stitching two images together kinda defeats that purpose, since the model would have to learn to unstitch them first (not to mention the poor scaling as image number increases; transformer networks dont scale linearly with input size). I do like the idea of first passing each image through a normal model like dino and then doing something with the resulting patch embeddings so to create information channels between similar patches. As for the exact architecture, thats something youd have to figure out yourself. I good starting point would be setting up this model in inference mode, passing your image sets through it and then doing statistical analysis on the resulting patch embeddings