Closed MurphyJUAN closed 1 year ago
Hey,
I dont think there is a memory leak, I was also using the same setup for most of my experiments with no error. Do you have anything else running on your cluster? Also are you sure this is CUDA OOM memory and not the CPU OOM error. It also might be that with using too high number of dataloaders the system can kill the process automatically.
If you are confident though that this is a cuda memory error and you dont have anything else running you could set the limit_max_number_of_points flag with a smaller hard limit and truncate too large scenes to be sampled together.
Hi, I am doing the distributed training on 2 RTX A6000 GPU with 48685MiB for semantic segmentation. When I set the batch size to 4 and execute in the middle of epoch 2, an OOM error pops up suddenly. However, as far as I know, can't even the first epoch run if the batch size is too large? How could I solve the issue? Is there a possibility of memory leaking? Thanks!