Out of memory, can not train on 11G machine

TRAILab / CaDDN

Categorical Depth Distribution Network for Monocular 3D Object Detection (CVPR 2021 Oral)

Apache License 2.0

366 stars 62 forks source link

Out of memory, can not train on 11G machine #39

Closed DuZzzs closed 3 years ago

DuZzzs commented 3 years ago

Hi,I set batch_size = 2, and I can not train resnet50 on 2080ti due to out of memory.Do you have any way to reduce the memory usage of the code? I tried to reduce the number of blocks in resnet50,but loss=nan, and error:

WARNING:root:NaN or Inf found in input tensor.

Thank you.

codyreading commented 3 years ago

Hello!

You could also increase the voxel size, which increases the dimensions of each voxel and therefore reduces the total number of voxels/BEV cells in the voxel/BEV grid. This can be adjusted by adjusting https://github.com/TRAILab/CaDDN/blob/master/tools/cfgs/kitti_models/CaDDN.yaml#L12.

Note that the number of cells in each dimension of the BEV grid (H, W) need to be divisible by 4 in order to be processed by the BEV backbone.

EDIT: I realized that this was 8 and not 4

DuZzzs commented 3 years ago

@codyreading When I set VOXEL_SIZE: [0.64, 0.64, 0.64], error:

x = torch.cat(ups, dim=l)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 48 and 47 tn dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71

Am I missing something？Thank you.

DuZzzs commented 3 years ago

The number should be divisible by 30.08, this time it can run normally. thank you very much,

codyreading commented 3 years ago

No problem! Closing this issue then. Also FYI, expect a degradation in performance with an increased voxel size.

DuZzzs commented 3 years ago

After I understand the code，I will try to copy mobilenetv3 from torchvision deeplabv3。 Thank you again.

IssamLaradji commented 3 years ago

what numbers are divisible by 30.08?

codyreading commented 3 years ago

The number of cells in each dimension of the BEV grid (H, W) need to be divisible by 8 in order to be processed by the BEV backbone, due to the downsample/upsample by 8. If we look at the current BEV grid H.

H = (Y_max - Y_min) / voxel_size_y = (30.08 - (- 30.08)) / 0.16 = 376
H / 8 = 376 / 8 = 47

As long as that final number is a whole number, this should be able to be processed by the BEV backbone.

For example

H = (Y_max - Y_min) / voxel_size_y = (30 - (- 30)) / 0.5 = 120
H / 8 =  120 / 8 = 15

czy341181 commented 3 years ago

@codyreading @DuZzzs Hi， could you tell me how to change the setting? I change the voxel to [1.0, 1.0, 1.0] but still Out of memory.

DuZzzs commented 3 years ago

@codyreading I set VOXEL_SIZE: [0.94, 0.94, 0.94], but when the training reaches epoch6, the program reports an error.

codyreading commented 3 years ago

@czy341181 How much GPU memory do you have? And what batch size are you using? A voxel size of [1.0, 1.0, 1.0] should not consume too much memory with a lower batch size.

codyreading commented 3 years ago

@DuZzzs What error are you reporting?

czy341181 commented 3 years ago

My GPU has 11178MB. When I set the voxel_size as [0.94, 0.94, 0.94], it runs for a while and then out of memory.

DuZzzs commented 3 years ago

@codyreading I didn't record that error, it was probably cudnn's error. When I used OpenPcDet's spconv1.0 corresponding docker, cudnn error also appeared. I think my environment is inconsistent with the official documentation. I will check this error in cudnn later. Thank you very much。

123456789live commented 2 years ago

请问你最后用什么方法减少的内存？ @czy341181