Why this codebase require a lot of GPU Memory?

Hey @mmaaz60,

So the GPU memory requirement depends on mostly two factors. First, on voxel resolution, where even for sparse voxel space and surface meshes the memory requirement is quadratic. Second, on the models parameter number. I haven't checked their codebase, but skimming through the implementation part they used a significantly smaller backbone which is probably the reason for being able to use larger batch size. It is definitely possible to use the same backbone with this project too, for this you just have to pick a model of your preference from here and either modify the last layer to be aligned with the CLIP output dimensions, or project the CLIP features directly.

I ran most of my experiments on A6000s, which is 48Gb and didn't have any problems with the batch size of 2. The thing is, it is hard to determine the batch size with shuffled datasets as there is a high variation of scene sizes, so and sampling 2 large scenes together could be significantly more challenging then two small or average.

But if you have problems with it I would suggest the following things:

Decrease the model size
Decrease the voxel resolution (maybe this is the least favorable idea as in most cases this limits the performance the strongest)
Set the --train_limit_numpoints flag, which will truncate batches if the n scenes cumulated voxel number exceeds your limitation - this number should be set accordingly to your system and model parameter number, but could be easily tested at what number you get OOM in the forward/backward pass.
Wrap the model_step with a try-catch block for CUDA OOM and skip the training step in those cases. Similar ida was implemented for the Detectron2 framework.

Let me know if this helps and all pull requests are welcome :)

Cheers, David

RozDavid / LanguageGroundedSemseg

Why this codebase require a lot of GPU Memory? #19