Closed zllxot closed 2 years ago
Hi, you may want to change the batch size to 1 to train.
Hi, you may want to change the batch size to 1 to train.
Thank you, I will try it.
In addition, I found that some parameters in the config file config.yaml
of the pretrained model you provided are slightly different from those in voxelnet_intermediate_fusion.yaml
, such as cav_lidar_range
. May I ask which config file should I follow to reproduce the results reported in the paper, and is it convenient to provide more details on the training of this model? Thanks!
I think both should be fine, the results are easy to reproduce. But following the pretrain model parameters should be better
I think both should be fine, the results are easy to reproduce. But following the pretrain model parameters should be better
OK,Thank you very much!
The default setting of epoches
in the config file _voxelnet_intermediatefusion.yaml is 30, but I have trained 60 epoches and the best test result is:
AP@0.5: 0.892,AP@0.7: 0.826 in Default; AP@0.5: 0.854,AP@0.7: 0.750 in Culver City.
The results reported in the paper are:
AP@0.5: 0.906,AP@0.7: 0.864 in Default; AP@0.5: 0.854,AP@0.7: 0.775 in Culver City.
Is this normal? Should I increase the number of epoches to continue training?
Which epoch did you pick? The last epoch or the one has the least loss on validation set?
Which epoch did you pick? The last epoch or the one has the least loss on validation set?
I pick epoch according to validate loss, below is the validate loss I saved during training:
I tested epoch 6, 13, 22, 40, 43, 50, 55, 58 and 60 epoch respectively, the best checkpoint is epoch 55.
your validation loss looks not very stable. Can you train more epochs and see what's going on? Previously I didn't train so many epoches.
your validation loss looks not very stable. Can you train more epochs and see what's going on? Previously I didn't train so many epoches.
OK, I'll try. Thanks.
Another question, did you try the pretrained checkpoint that I provide? Can you get the same result?
Another question, did you try the pretrained checkpoint that I provide? Can you get the same result?
Yes, I got the same results as reported in the paper by directly testing the pretrained model:
Default:
Culver City:
Hi! I tested epoch 98 and got the same results as in the paper. By the way, how to add compression for fine-tuning? I tried to modify the compression flag in config.yaml and run the training, but the following error occurred:
When you load the model, did you set strict=false?
When you load the model, did you set strict=false?
Sorry, I ignored this, thanks for the reminder. I would also like to ask how many epoches need to be trained for fine-tuning?
I think 2-3 should be fine
I think 2-3 should be fine
OK, I'll try it, thanks again!
After fine-tuning 2 epoches on epoch 98, I found that the test results of epoch 100 were particularly low (AP0.5: 0.586, AP0.7: 0.251 on Culver City), so I continued to fine-tune several epoches, validate loss is as follows:
I also tested epoch 117 on Culver City and the results were AP0.5: 0.807 and AP0.7: 0.621, which is still significantly lower than expected results.
I feel your training is slower than mine for some reason.... in this case, I suggest continue training. Also, you may not want to set the learning rate too low, may be unstable for this finetune.
Thanks for your suggestion.
I feel your training is slower than mine for some reason.... in this case, I suggest continue training. Also, you may not want to set the learning rate too low, may be unstable for this finetune.
I tried to change the learning rate many times but still can't get the same results as in the paper after fine-tuning for 2~3 epoches (even use the pre-trained model without fine-tuned you provided). Could you please tell me more details about the fine-tuning? Thank you!
Then fine tune more epoches. It was pretty easy for me to get the results without any technique
Then fine tune more epoches. It was pretty easy for me to get the results without any technique
OK.
After fine-tuning about 40 epoches, I got similar results as in the paper. I guess the learning rate was set too low before. Thank you again for your warm assistance!
Glad to help you solve the problem
Hi, you may want to change the batch size to 1 to train.
Hi,
I'm also facing the same problem I have already changed the batch size to 1 but I'm still having cuda out of memory.
do you have any idea on how can I solve this ?
I guess it's probably because it is recommended to use 6G GPU but I only have a 4G GPU.
I have tried to run this on google Colab but I'm also facing problem with the .pyx file when executing:
python opencood/utils/setup.py build_ext --inplace
running build_ext building 'opencood.utils.box_overlaps' extension x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.7/dist-packages/numpy/core/include -I/usr/include/python3.7m -c opencood/utils/box_overlaps.c -o build/temp.linux-x86_64-3.7/opencood/utils/box_overlaps.o opencood/utils/box_overlaps.c:29:10: fatal error: Python.h: No such file or directory 29 | [#include](https://www.facebook.com/hashtag/include?__eep__=6&__cft__[0]=AZUnzBWRV-3MvIegt2s2tRJUZUxw1IU8NnggVQ0A-01Eb79wzGptFlw8CM5i4ny6VGxgnnmqJxjgCLR7FtynFU6p4h0-e9OPDOCbEzdg1m-pDT5t7K4Wb2zd6RgBS0tJGI8jr0O1YbgtFyqoAO5HSD3L&__tn__=*N-UK-R) "Python.h" | ^~~~~~~~~~compilation terminated. error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Hi!I train the _voxel_netintermediate model following the default settings in the config file (e.g., voxelnet_intermediate_fusion.yaml), but after every several epoches, the program will be interrupted due to "CUDA out of memory". (The code is run on a single RTX 3090Ti GPU)