facebookresearch / vggsfm

[CVPR 2024 Highlight] VGGSfM Visual Geometry Grounded Deep Structure From Motion
Other
447 stars 29 forks source link

CUDA OOM Error when running on 31 (700 x 1388) Images #12

Open Nik-V9 opened 1 week ago

Nik-V9 commented 1 week ago

Hey Jianyuan, Nice work! Thanks for sharing the code!

I'm currently trying to benchmark VGGSfM on some of my own data. To get things started, I was using the demo file on a folder with 31 images of size 700 x 1388 (H x W). I was following this issue: https://github.com/facebookresearch/vggsfm/issues/2

I'm using an A100 with 80 GB memory. With the default settings, I run into OOM very early:

image

When I set cfg.max_ransac_iters to 1024, I still run into OOM but later in the pipeline:

image image

Same case with 512:

image image

I was wondering if such a high GPU memory cost is expected? What would the best representative way to benchmark VGGSfM?

jytime commented 1 week ago

Hi Nikhil @Nik-V9 ,

We are aware that the current implementation will lead to a high gpu consumption and are preparing a version that will save half of the memory. It should be released within this week (waiting approval).

However, this still seems weird as you have a 80GB gpu. Would you mind sharing your images to me by email? I can have a look on them.

If you want to try something quickly, here are some flags:

  1. Enable use_bf16 in the config. You can do this as you are using an A100 GPU, and will save the gpu mem by 40%.
  2. Reduce query_frame_num to 2 or 1.

I will also update in this thread when the mem-friendly version is ready for release

Nik-V9 commented 1 week ago

Thanks for the quick response! I've emailed the data to you.

Setting the use_bf16 config to True and query_frame_num to 2 seems to work. However, the output is not good. I wonder if the config changes impact the performance?

The data is pretty hard and I don't think it would work out of the box. Let me know your thoughts!

jytime commented 1 week ago

Hey I got it. I think this is because our demo script was not designed for videos like car driving, since moving forward only is a very special case for SfM. Such motions have little rotation for a period and are hence quite challenging (also difficult for colmap as far as I know).

But I think there is a way to make our method work in such data. I am happy to write a customised script for this after giving my cvpr talks and presentations. @Nik-V9

Nik-V9 commented 1 week ago

Yup, I agree. However, I guess these kinds of cases are very much representative of how real people capture videos (continuous forward motion coupled with sudden rotations; they don't necessarily do it in the NeRF-y or object-centric way) ;)

I used the following parameters for COLMAP: mapper.init_min_tri_angle = 1 and mapper.init_max_forward_motion = 1.0. Even then, performance on the benchmark is hit or miss.

Sounds great; no rush! I'm happy to use the customized version and the other newer version for benchmarking once they are ready :)

jytime commented 4 days ago

Hi @Nik-V9 ,

Here are the results from our memory-efficient version, demonstrated on our Hugging Face demo: Hugging Face Demo. This used 6 query images and 2048 query points, and the first 25 frames.

While our current version effectively handles the forward movement, it struggles with the left turn. As shown in the visualization, the cameras during the left turn are noisy, and the point cloud appears incomplete. This issue likely arises from rapid turning, which reduces image overlap.

Although I can tune the hyperparameters to make our method work on this specific video, it may compromise generalization. A better solution is our upcoming video processing version, which will leverage frame continuity and keep the cameras tracked. I will update here once the video version is released.

Screenshot 2024-06-27 at 02 15 39
jytime commented 4 days ago

Here is the result used 4query images and 4096 query points

Screenshot 2024-06-27 at 02 33 36
Nik-V9 commented 4 days ago

Nice, Thanks for working on this!

Looking forward to the video version! Would the video version break for non video like inputs? For example, two opposing viewpoint trajectories.

jytime commented 3 days ago

It should work, but will give two separate point clouds