Batch Size - Out of Memory error? -> with newer PyTorch version

GregorKobsik commented 10 months ago

Hi @SilenKZYoung ,

I currently tried to evaluate your model. Unfortunately, the batch size of 32 does definitely not fit into 11GB of VRAM, not even 16. I could run the training only on a batch size of 8.

I used a RTX 2080Ti. Could you please tell me, how did you fit your Model on a GTX 1080Ti ?

I will try to get my hands on an GTX 1080Ti and try it once again.

ERROR MESSAGE:

Traceback (most recent call last):
  File "/clusterstorage/gkobsik/learning-relationships/./CuboidAbstractionViaSeg/E_train.py", line 344, in <module>
    main(args)
  File "/clusterstorage/gkobsik/learning-relationships/./CuboidAbstractionViaSeg/E_train.py", line 123, in main
    loss, loss_dict = loss_func(points, normals, outdict, None, hypara)
  File "/clusterstorage/gkobsik/anaconda3/envs/learning_relationships/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/clusterstorage/gkobsik/learning-relationships/CuboidAbstractionViaSeg/losses.py", line 62, in forward
    randn_dis = (torch.randn((batch_size,num_points)) * self.std).cuda().detach()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

GregorKobsik commented 10 months ago

Same Issue on RTX 3090 with 24 GB of VRAM. Working only with a batch size of 8.

GregorKobsik commented 10 months ago

updated the environment to an older version (aka Pytorch=1.5.1 and CUDA=10.2).

Running with a batch size of 32 results in an error:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 10.75 GiB total capacity; 6.71 GiB already allocated; 1.95 GiB free; 8.02 GiB reserved in total by PyTorch)

GregorKobsik commented 10 months ago

P.S.

Managed to run the code on a RTX 2080Ti with a batch size of 31, so I suppose it can be attributed to some inconsistencies with the architecture of the GPU.

Still strange, that a newer version of PyTorch needs so much memory, that I need to reduce the batch size to as low as 8, to be able to run the code.

SilenKZYoung / CuboidAbstractionViaSeg

Batch Size - Out of Memory error? -> with newer PyTorch version #9