facebookresearch / vggsfm

VGGSfM: Visual Geometry Grounded Deep Structure From Motion
Other
940 stars 72 forks source link

Support for large datasets #24

Open MatthewDZane opened 4 months ago

MatthewDZane commented 4 months ago

Thanks for the amazing work. We have datasets ranging from 1k images to 14k images and we were wondering if vggsfm is going to support images of that magnitude in the pipeline in the near future. The simplest solution I am guessing is you could probably separate the images in submodels with overlap, process them using vggsfm and then combine them.

jytime commented 4 months ago

Hi @MatthewDZane

We will soon support the reconstruction for videos with such a large number of images, leveraging the assumed time continuity. For unordered images of such quantity, we may take a bit more time. Is your dataset composed of videos or unordered images?

MatthewDZane commented 4 months ago

Our datasets are composed of ordered images but not in video format, they were taken using a drone on a flight path so there is some distance between a photo and the image that follows it. There are also some discontinuities where we don’t take pictures when the drone turns a corner.

jytime commented 4 months ago

Hi it should be possible to solve it (my guess, but cannot guarantee it). I will let you know when the video version is ready.

bhack commented 4 months ago

@jytime I think it could be interesting to verify the robustness on different FPS and also other classical video issues like MB, defocus etc..

Also, as we have discussed in https://github.com/facebookresearch/vggsfm/issues/9#issuecomment-2211765097 it could be not always easy to maintain the assumption that all points are static/rigid specially on long sequences: https://github.com/qianduoduolr/DecoMotion https://tracks-to-4d.github.io/ https://chiaki530.github.io/projects/leapvo/ https://henry123-boy.github.io/SpaTracker/ (Camera pose estimation in dynamic scene section)

I think that without static and dynamic point clustering or classification (like occlusion/visibility) the risk of requiring an uncontrolled number of binary masks on videos for each required query frame it could be very problematic.

bhack commented 4 months ago

Other then this I will add that some SOTA points trackers are still often failing on odometry-like sequences https://github.com/google-deepmind/tapnet/issues/72

jytime commented 4 months ago

Hi @bhack Yes I agree with this point. I am testing the effect using an off-the-shelf video motion segmentation model to filter out the dynamic pixels.

bhack commented 4 months ago

Many motion segmentation models that rely on optical flow networks are suffering on the described classical video effects (defocus/MB).

It is also in the limitation section of the recent ECCV 2024 DecoMotion (mentioned in the previous message). Probably we could simulate MB/defocus with augmentation over available datasets if you have not collect something specific with this effect. Different FPS rate instead could be easily simulated.