DerrickXuNu / OpenCOOD

[ICRA 2022] An opensource framework for cooperative detection. Official implementation for OPV2V.
https://mobility-lab.seas.ucla.edu/opv2v/
Other
663 stars 99 forks source link

CUDA out of memory #39

Closed zllxot closed 2 years ago

zllxot commented 2 years ago

Hi!I train the _voxel_netintermediate model following the default settings in the config file (e.g., voxelnet_intermediate_fusion.yaml), but after every several epoches, the program will be interrupted due to "CUDA out of memory". (The code is run on a single RTX 3090Ti GPU)

80981a26-4c49-4e1a-a68d-665afea48f21

6299b99b-344a-42d6-b2b8-fef5eaa5aabd
DerrickXuNu commented 2 years ago

Hi, you may want to change the batch size to 1 to train.

zllxot commented 2 years ago

Hi, you may want to change the batch size to 1 to train.

Thank you, I will try it. In addition, I found that some parameters in the config file config.yaml of the pretrained model you provided are slightly different from those in voxelnet_intermediate_fusion.yaml, such as cav_lidar_range. May I ask which config file should I follow to reproduce the results reported in the paper, and is it convenient to provide more details on the training of this model? Thanks!

微信截图_20220906215408
DerrickXuNu commented 2 years ago

I think both should be fine, the results are easy to reproduce. But following the pretrain model parameters should be better

zllxot commented 2 years ago

I think both should be fine, the results are easy to reproduce. But following the pretrain model parameters should be better

OK,Thank you very much!

zllxot commented 2 years ago

The default setting of epoches in the config file _voxelnet_intermediatefusion.yaml is 30, but I have trained 60 epoches and the best test result is:

AP@0.5: 0.892,AP@0.7: 0.826 in Default; AP@0.5: 0.854,AP@0.7: 0.750 in Culver City.

The results reported in the paper are:

AP@0.5: 0.906,AP@0.7: 0.864 in Default; AP@0.5: 0.854,AP@0.7: 0.775 in Culver City.

Is this normal? Should I increase the number of epoches to continue training?

DerrickXuNu commented 2 years ago

Which epoch did you pick? The last epoch or the one has the least loss on validation set?

zllxot commented 2 years ago

Which epoch did you pick? The last epoch or the one has the least loss on validation set?

I pick epoch according to validate loss, below is the validate loss I saved during training:

image image

I tested epoch 6, 13, 22, 40, 43, 50, 55, 58 and 60 epoch respectively, the best checkpoint is epoch 55.

DerrickXuNu commented 2 years ago

your validation loss looks not very stable. Can you train more epochs and see what's going on? Previously I didn't train so many epoches.

zllxot commented 2 years ago

your validation loss looks not very stable. Can you train more epochs and see what's going on? Previously I didn't train so many epoches.

OK, I'll try. Thanks.

DerrickXuNu commented 2 years ago

Another question, did you try the pretrained checkpoint that I provide? Can you get the same result?

zllxot commented 2 years ago

Another question, did you try the pretrained checkpoint that I provide? Can you get the same result?

Yes, I got the same results as reported in the paper by directly testing the pretrained model:

Default:

微信截图_20220908135632

Culver City:

微信截图_20220908113224
zllxot commented 2 years ago

Hi! I tested epoch 98 and got the same results as in the paper. By the way, how to add compression for fine-tuning? I tried to modify the compression flag in config.yaml and run the training, but the following error occurred:

微信截图_20220910162043
DerrickXuNu commented 2 years ago

When you load the model, did you set strict=false?

zllxot commented 2 years ago

When you load the model, did you set strict=false?

Sorry, I ignored this, thanks for the reminder. I would also like to ask how many epoches need to be trained for fine-tuning?

DerrickXuNu commented 2 years ago

I think 2-3 should be fine

zllxot commented 2 years ago

I think 2-3 should be fine

OK, I'll try it, thanks again!

zllxot commented 2 years ago

After fine-tuning 2 epoches on epoch 98, I found that the test results of epoch 100 were particularly low (AP0.5: 0.586, AP0.7: 0.251 on Culver City), so I continued to fine-tune several epoches, validate loss is as follows:

微信截图_20220913095016

I also tested epoch 117 on Culver City and the results were AP0.5: 0.807 and AP0.7: 0.621, which is still significantly lower than expected results.

DerrickXuNu commented 2 years ago

I feel your training is slower than mine for some reason.... in this case, I suggest continue training. Also, you may not want to set the learning rate too low, may be unstable for this finetune.

zllxot commented 2 years ago

Thanks for your suggestion.

zllxot commented 2 years ago

I feel your training is slower than mine for some reason.... in this case, I suggest continue training. Also, you may not want to set the learning rate too low, may be unstable for this finetune.

I tried to change the learning rate many times but still can't get the same results as in the paper after fine-tuning for 2~3 epoches (even use the pre-trained model without fine-tuned you provided). Could you please tell me more details about the fine-tuning? Thank you!

DerrickXuNu commented 2 years ago

Then fine tune more epoches. It was pretty easy for me to get the results without any technique

zllxot commented 2 years ago

Then fine tune more epoches. It was pretty easy for me to get the results without any technique

OK.

zllxot commented 2 years ago

After fine-tuning about 40 epoches, I got similar results as in the paper. I guess the learning rate was set too low before. Thank you again for your warm assistance!

DerrickXuNu commented 2 years ago

Glad to help you solve the problem

eyabesbes commented 6 months ago

Hi, you may want to change the batch size to 1 to train.

Hi, I'm also facing the same problem I have already changed the batch size to 1 but I'm still having cuda out of memory. do you have any idea on how can I solve this ? I guess it's probably because it is recommended to use 6G GPU but I only have a 4G GPU. I have tried to run this on google Colab but I'm also facing problem with the .pyx file when executing: python opencood/utils/setup.py build_ext --inplace

running build_ext building 'opencood.utils.box_overlaps' extension x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.7/dist-packages/numpy/core/include -I/usr/include/python3.7m -c opencood/utils/box_overlaps.c -o build/temp.linux-x86_64-3.7/opencood/utils/box_overlaps.o opencood/utils/box_overlaps.c:29:10: fatal error: Python.h: No such file or directory 29 | [#include](https://www.facebook.com/hashtag/include?__eep__=6&__cft__[0]=AZUnzBWRV-3MvIegt2s2tRJUZUxw1IU8NnggVQ0A-01Eb79wzGptFlw8CM5i4ny6VGxgnnmqJxjgCLR7FtynFU6p4h0-e9OPDOCbEzdg1m-pDT5t7K4Wb2zd6RgBS0tJGI8jr0O1YbgtFyqoAO5HSD3L&__tn__=*N-UK-R) "Python.h" | ^~~~~~~~~~compilation terminated. error: command 'x86_64-linux-gnu-gcc' failed with exit status 1