Closed alwc closed 4 years ago
It is weird. Let me retrain the network.
Thanks @cherubicXN !
Thanks @cherubicXN !
I remembered that the sAP will rapidly increase to 60+ in the very early stage. Did you check the sAP10 at 10, 15, 20, 25 epochs?
Here are the sAP10 evaluations from my trained models:
model_00001.pth: 49.0 model_00005.pth: 57.0 model_00010.pth: 58.5 model_00015.pth: 58.0 model_00020.pth: 58.6 model_00025.pth: 58.0
Also note that .png
files I'm using were created from https://github.com/zhou13/lcnn/blob/master/dataset/wireframe.py since your provided images are .jpg
files.
Here are the sAP10 evaluations from my trained models:
model_00001.pth: 49.0 model_00005.pth: 57.0 model_00010.pth: 58.5 model_00015.pth: 58.0 model_00020.pth: 58.6 model_00025.pth: 58.0
Also note that
.png
files I'm using were created from https://github.com/zhou13/lcnn/blob/master/dataset/wireframe.py since your provided images are.jpg
files.
Ok, I am training the network again. It may take 12 hours. I am watching the training logs now.
Here are the sAP10 evaluations from my trained models:
model_00001.pth: 49.0 model_00005.pth: 57.0 model_00010.pth: 58.5 model_00015.pth: 58.0 model_00020.pth: 58.6 model_00025.pth: 58.0
Also note that
.png
files I'm using were created from https://github.com/zhou13/lcnn/blob/master/dataset/wireframe.py since your provided images are.jpg
files.
The sAP log indicates that the network achieves the better sAP at epoch 10 and the remaining training epochs seem to be useless. That's really weird.
After checking the previous model weights, I guess there may be some bugs in the training code. The sAP should be greater than 60 after 10 epochs of training and approaching 63.0 after 25 epochs. After the learning rate decayed, the sAP10 should be dramatically increased to 66+.
@cherubicXN Interesting, thanks for your insights and I hope you could figure out the bugs!
If you need more GPU machines to do more experiments, please let me know. I can train the model for you to see if I could reproduce the results.
reproduce
Thanks very much :). I have enough GPU machines and the training time is not too long. Maybe I made some mistakes when I was refactoring the code.
@alwc, I think I have found the bug. It is caused by the incorrect use of the lr_scheduler in train.py. At line 116 of the train.py, I made a mistake to call the learning rate scheduler by
scheduler.step(epoch)
which will make the learning rate decayed after the 1st epoch of training.
The correct implementation should be
scheduler.step()
You can check the training log in your machine to see if the learning rate is decayed to 4e-5 after the 1st epoch of training.
You can check the training log in your machine to see if the learning rate is decayed to 4e-5 after the 1st epoch of training.
I think you are right. Looking at my old training log, at the beginning of epoch: 2
, the lr
is 0.000040
and the new training log with the scheduler bug fix is 0.0004
. Thanks @cherubicXN
Here are the sAP10 evaluations for the first 5 epochs from the new model:
Right now the results are roughly the same as the previous model for the first 5 epochs. I'll update you with the final result once the model is done training.
On a side note, I surprised to see model_00001.pth
has a different result (old 49.0 vs new 48.4). It seems the seed doesn't work properly.
You can check the training log in your machine to see if the learning rate is decayed to 4e-5 after the 1st epoch of training.
I think you are right. Looking at my old training log, at the beginning of
epoch: 2
, thelr
is0.000040
and the new training log with the scheduler bug fix is0.0004
. Thanks @cherubicXNHere are the sAP10 evaluations for the first 5 epochs from the new model:
- model_00001.pth: 48.4
- model_00002.pth: 51.9
- model_00003.pth: 54.2
- model_00004.pth: 56.9
- model_00005.pth: 56.9
Right now the results are roughly the same as the previous model for the first 5 epochs. I'll update you with the final result once the model is done training.
On a side note, I surprised to see
model_00001.pth
has a different result (old 49.0 vs new 48.4). It seems the seed doesn't work properly.
It is normal. The model weights in the first epoch are not stable. Today, I obtained 49.7 for model_00001.pth
. I am also training the network, hope it works well.
Hi @cherubicXN ,
I just completed the training with the bug fixed and here are the sAP10 results for the last 5 epochs:
The results are much better than the results before the bug fix, but it is still a little bit off (66.0 vs 66.5). Not sure the discrepancy is due to randomness or other minor issues.
Hi @cherubicXN ,
I just completed the training with the bug fixed and here are the sAP10 results for the last 5 epochs:
- model_00025.pth: 63.3
- model_00026.pth: 65.8
model_00027.pth: 66.0
- sAP5: 62.2
- sAP15: 67.7
- model_00028.pth: 65.8
- model_00029.pth: 65.4
- model_00030.pth: 65.5
The results are much better than the results before the bug fix, but it is still a little bit off (66.0 vs 66.5). Not sure the discrepancy is due to randomness or other minor issues.
I think it is due to randomness. I also completed the training and obtained 66.4 for sAP10 at 27 epochs. Let me train it again.
Hi @cherubicXN ,
After training the model with your given code and data, I tested with
and evaluated with
Note that I'm getting
sAP10.0 = 58.4
, which is much lower than the stated result in the paper (i.e. 66.5). If I ran the code above using your provided pre-trained model. I could getsAP10.0 = 66.5
.FYI, I'm using PyTorch 1.4.0 with Python 3.6, trained on one 2080Ti GPU.
Here are the settings I used:
Log from the last epoch: