RozDavid / LanguageGroundedSemseg

Implementation for ECCV 2022 paper Language-Grounded Indoor 3D Semantic Segmentation in the Wild
98 stars 14 forks source link

RuntimeError: CUDA out of memory after many epochs #15

Closed MurphyJUAN closed 1 year ago

MurphyJUAN commented 1 year ago

Hi, I am doing the distributed training on 2 RTX A6000 GPU with 48685MiB for semantic segmentation. When I set the batch size to 4 and execute in the middle of epoch 2, an OOM error pops up suddenly. However, as far as I know, can't even the first epoch run if the batch size is too large? How could I solve the issue? Is there a possibility of memory leaking? Thanks!

RozDavid commented 1 year ago

Hey,

I dont think there is a memory leak, I was also using the same setup for most of my experiments with no error. Do you have anything else running on your cluster? Also are you sure this is CUDA OOM memory and not the CPU OOM error. It also might be that with using too high number of dataloaders the system can kill the process automatically.

If you are confident though that this is a cuda memory error and you dont have anything else running you could set the limit_max_number_of_points flag with a smaller hard limit and truncate too large scenes to be sampled together.