Closed YanFangCS closed 2 years ago
Hi, it really confused me... We have trained several times on classic PASCAL VOC 2012 and it was very stable.
Could you provide your training log?
Or, could you use 8 GPUs with batch_size=2
and change the lr
to 0.001
?
I'm not sure why you got unexpected results :(
The training log I get is as following. In this experiment, I use the config file as mentioned before. seg_20220318_163031.txt Moreover, because of computation resource limited, I can't get 8GPUs to do this reproduce experiments, I can use 4x 3090 cards until now. Thanks for your help.
The best performance is in epoch 10
, which is quite wired...
Did you change the random seed for 3 different running?
emm, I haven't changed random seed, all experiments use the default seed 2. And parameters I have change are batch size, learning rate, training process port and dataset directory. Besides, no parameters have been changed.
Hi, @YanFangCS we have retrained our model on 4 V100
with batch_size=16
and lr=0.001
, here is our training log.
I am not sure why you cannot reproduce the results... :(
Maybe batch_size=16
and lr=0.001
are two important parameters?
thanks for your help, I will try more times to reproduce it. Thanks for your work, introducing new unreliable perspective.
By the way, I am wondering why the total training iterations is 45600 when set epochs as 80. That means there exists 570 iterations per epoch, but given supervised dataset size 1464, unsupervised dataset size 9118 and batchsize of 4(16 of total). It's quite wired.
emm, I see, you calculate epochs according to unsupervised dataset size. This calculation is same as AEL does.
By the way, I am wondering why the total training iterations is 45600 when set epochs as 80. That means there exists 570 iterations per epoch, but given supervised dataset size 1464, unsupervised dataset size 9118 and batchsize of 4(16 of total). It's quite wired.
Yes, an epoch is defined as the iterations that the model is trained by all unsupervised images.
I have reproduced the result as paper declares. I solve this problem by using batch size 16, lr 0.001 with torch.cuda.amp, which is similar to apex. It consumes about 15G cuda memory for RTX3090 with amp which is affordable. So, I think batch size and lr are essential for reproducing this paper. The model can't be sucessfully trained with half bs and lr, it still confuses me a lot. Thanks for your help.
Hi, I try to reproduce your results reported in your paper but can't reach your results as your paper report. Because of computation resource limited, I use batch size 8 and learning rate 0.0005 which are half in your paper.
When try to reproduce "full" method using the superparameters mentioned before,I just reach 77.12 (running 3 times and average). Could you give me some advice to reproduce your method, thanks. The config file I used in reproduce experiment is as following. Besides, the annotations files are get as you mentioned in your repo.