Closed mmaaz60 closed 1 year ago
Hey @mmaaz60,
So the GPU memory requirement depends on mostly two factors. First, on voxel resolution, where even for sparse voxel space and surface meshes the memory requirement is quadratic. Second, on the models parameter number. I haven't checked their codebase, but skimming through the implementation part they used a significantly smaller backbone which is probably the reason for being able to use larger batch size. It is definitely possible to use the same backbone with this project too, for this you just have to pick a model of your preference from here and either modify the last layer to be aligned with the CLIP output dimensions, or project the CLIP features directly.
I ran most of my experiments on A6000s, which is 48Gb and didn't have any problems with the batch size of 2. The thing is, it is hard to determine the batch size with shuffled datasets as there is a high variation of scene sizes, so and sampling 2 large scenes together could be significantly more challenging then two small or average.
But if you have problems with it I would suggest the following things:
--train_limit_numpoints
flag, which will truncate batches if the n scenes cumulated voxel number exceeds your limitation - this number should be set accordingly to your system and model parameter number, but could be easily tested at what number you get OOM in the forward/backward pass.model_step
with a try-catch block for CUDA OOM and skip the training step in those cases. Similar ida was implemented for the Detectron2 framework. Let me know if this helps and all pull requests are welcome :)
Cheers, David
Hi @RozDavid,
Thank You for the great work. I was wondering why this codebase require significantly large GPU memory? For example sometimes 40GB GPU isn't enough for 2 batch size. Also if I scale the number of GPUs (for example to 4 GPUs) then the memory utilization of the first GPU increases a lot and quickly gives the OOM error. I just came across the paper (https://arxiv.org/pdf/2211.15654.pdf) which seems to be a more complex architecture than this paper and they claimed to use BS of 8 for scannet on a single 40GB GPU.
Do you have any suggestions to improve the memory utilization of the codebase other than cropping the voxels? Thanks