Atten4Vis / ConditionalDETR

This repository is an official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence". (https://arxiv.org/abs/2108.06152)
Apache License 2.0
358 stars 48 forks source link

How to use resume correctly? #12

Closed xziyh closed 2 years ago

xziyh commented 2 years ago

Hello, My PC can't keep training, so I run three epochs a day for about 12 hours. I just set the parameter “resume” to checkpoint.pth, when other parameters remain unchanged, the effect seems not very good. After running more than a dozen epochs, the effect of the first few is almost the same. So I want to ask how to use resume correctly.When I pause after running to an epoch, do I need to adjust the learning rate when I continue next time? Should the "--start_epoch" be set to the last checkpoint?

DeppMeng commented 2 years ago

Hi, could you provide the log in your initial training and resuming experiments? As well as your full training configs so that we can better locate your problem.

xziyh commented 2 years ago

Thanks for your reply, are you referring to the full training log file?How should i upload them? There are the traing results: It's the 8th epoch: 95b741d999a5c2ffd417447777137d6 It's the latest 15th epoch: 4fec5e8ad83746b5954d85cb843539f

DeppMeng commented 2 years ago

Yes, I mean the full training log file. You can send the files to my email: mdpustc@gmail.com.

From your screenshot, clearly the log is not normal. We provided our training log at here. You can check it out for comparison. How about the loss and AP at the first few epochs? The AP at epoch 1 should be around 5. If you get 0AP for all epochs, I think the problem is not related to resume, but other factors. So it is the best if you can provide your full training configs, including batch-size, number of GPUs, PyTorch version, cuda version, etc. I hope you will find it helpful.

xziyh commented 2 years ago

I use one 3080ti for training,the version of pytorch is 1.10.0,batch size is 1,and the cuda version is 11.3.1.And i have send my log file to ur email.Thanks for ur kindly reply again.

DeppMeng commented 2 years ago

I read your log. Apperantly the model is not properly trained. The loss and AP at epoch 1 is not normal. I guess the reason might be the too small batch-size, which might cause training instable. For training setting, we use 8xV100, with batch-size 8x2=16. You need this setting to reproduce our result. We did not conducted any experiments with batch-size less than 8.

xziyh commented 2 years ago

Refer Thanks for your kindly reply,i set the batch size to 2 now to see how would the result be.