Dynamic context - Githubissues

SSamDav commented 2 months ago

Hi!

First of all congrats for the awesome paper!! I was think if we could change the training to allow dynamic context like instead of fixing in 2 images allow to be from 1 to N images? Also could you provide information of the compute used to train the models? Because I wanted to explore this dynamic context based in this work.

btsmart commented 2 months ago

Hello,

I would be very interested in expanding the context to include more than two images, but it is still a little unclear how best to do this, particularly if we want to perform cross-attention between all of the images. We base our work on MASt3R which only supports directly predicting point clouds from two images at a time, and expanding it to include more than two images is non-trivial. See this issue for more info: https://github.com/btsmart/splatt3r/issues/7. We do plan to explore this more in the future though.

The checkpoint that we have shared and the results we have reported were trained for ~10 hours on 4 A6000s (48GB GPUs), however I have been able to run training and testing on a single 12gb RTX 2080ti. Because we use MASt3R as a backbone our model converges quite quickly, so I suspect you could get meaningful results in <24 hours on a smaller GPU.

Hope this helps!

SSamDav commented 2 months ago

Thanks for the answer! So do you think if we can augment Mast3r for multiple images the same technique would allow us to augment to slatt3r?

btsmart commented 2 months ago

Potentially, our architecture is only a minor modification of theirs so most changes to MASt3R should work with our method. For what it is worth MASt3R does work with more than two images, but that is by evaluating each pair of images using the model then performing global alignment to merge all the predictions together, which I have not implemented yet for Splatt3R for the reasons given in https://github.com/btsmart/splatt3r/issues/7.

You might also be interested in pixelSplat (https://arxiv.org/pdf/2312.12337). They also use a feed-forward model to predict Gaussians (from posed images), and they experiment with using three views as input.

btsmart / splatt3r

Dynamic context #13