Open snakers4 opened 5 years ago
Hi snakers4, thx for your advice.
I have checked your repo before. In terms of why it is better: if I just use SGD, it already achieved 65% top1. The reason might be RELU. And during my training of Mnasnet, I found that the representation power is a little bit weak since the training loss is higher than the testing loss, so I change the dropout rate from default 0.5 to 0.0, which indeed boosted the performance to 68%.
I have also tried adam and rmsprop, but they just cannot converge in my case.
if I just use SGD, it already achieved 65% top1
You mean just using my model with SGD or your model?
so I change the dropout rate from default 0.5 to 0.0, which indeed boosted the performance to 68%
Interesting, afaik we did not use any dropout at all
I have also tried adam and rmsprop, but they just cannot converge in my case.
Interesting. Well, anyway, just give adamw and a larger batch a try =)
Also @Randl trained MobileNet2 with adam and SGD, adam converged 3x faster, but SGD converged only +1 pp better ...
All of this tells me that the newer networks are getting more and more fragile ...
My model + SGD
Agree with you, newer nets should be carefully tuned. Still don't know how the paper get 74%. Maybe large batchsize matters, but currently I may not have that big computation power to do it
We will see what @Randl will comment, he has more GPUs now afaik
@billhhh I use this code ,but the loss is not change,could you help to solve it~
I've managed to achieve 72+% top-1, however, I also managed to fuck up checkpointing, thus there is no checkpoint (yet).
@Randl Wow, that's pretty good result! Did you use 224 input? How about other settings? The same as mine or different?
nyway, just give adamw and a larger batch a try =)
have you solved the problem? i train the network but the loss did not drop.
Thanks for this repo! I managed to obtain ~40-45% tops, looks like you could achieve ~69%.
From the major architecture differences I noticed only RELU6. Did it boost accuracy, or it is just inherited from MobileNet?
This would also point me to using open AI adamW. This is more or less a continuous version of your training regime. Would be interesting if you tried it. Also it converges quite quickly.
There is some evidence, that for such models batch-size of 1000-2000 is preferable =(