dcharatan / flowmap

Code for "FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent" by Cameron Smith*, David Charatan*, Ayush Tewari, and Vincent Sitzmann
https://cameronosmith.github.io/flowmap/
MIT License
833 stars 80 forks source link

OOM issue #13

Closed pdxmusic closed 2 months ago

pdxmusic commented 2 months ago

First, thanks for your work!

I was wondering if it was possible to know how much VRAM this code needs.

I tried running it on a dataset of 300 1920x1080 photos but even though I'm using a 24GB 4090 it goes OOM. I also tried reducing the dataset to 200 photos but the same problem occurred.

dcharatan commented 2 months ago

Using the settings we used for the paper, a 150-image sequence requires about 40 GB of memory. However, it would probably be possible to substantially lower memory usage by doing the following:

If you're building on top of FlowMap, these things should be fairly easy to implement. If you're only interested in using FlowMap as an off-the-shelf tool, stay tuned for updates, since we'll hopefully be updating the repo with some of these improvements. See #4 for a few more details/advice on how to run FlowMap at a lower resolution.

leo-frank commented 2 months ago

Hello author, have you tested the performance differences when using this configuration?

Personally, I'm not quite sure what performance differences adjusting specific layers of the network might bring. Do you have any insights on this?

kk6398 commented 2 months ago

Hello author, have you tested the performance differences when using this configuration?

  • Freezing the first part of the depth estimator and only fine-tuning later layers

Personally, I'm not quite sure what performance differences adjusting specific layers of the network might bring. Do you have any insights on this?

I try to use the “xxx --low_memory” to get the output, then I utilize the output file as the input of 3DGS. However, the performance is much lower than the paper indicates.

dcharatan commented 2 months ago

@leo-frank, I haven't tested this configuration yet. The thought is that earlier layers of the network mainly do feature extraction, while the later layers do depth-related computation. Since feature extraction should be universal, it might make sense to freeze these layers to save ~50% of the computation while hopefully not impacting the network's expressiveness too much.

@kk6398 There are a few things to note: