billhhh / MnasNet-pytorch-pretrained

A pytorch pretrained model of MnasNet
20 stars 2 forks source link

Architecture discussions #2

Open snakers4 opened 5 years ago

snakers4 commented 5 years ago

Thanks for this repo! I managed to obtain ~40-45% tops, looks like you could achieve ~69%.

From the major architecture differences I noticed only RELU6. Did it boost accuracy, or it is just inherited from MobileNet?

Starting from lr 0.1, and decayed to its 0.5 every 20 epochs.

This would also point me to using open AI adamW. This is more or less a continuous version of your training regime. Would be interesting if you tried it. Also it converges quite quickly.

256 batchsize with 2 K80 GPU.

There is some evidence, that for such models batch-size of 1000-2000 is preferable =(

snakers4 commented 5 years ago

OpenAI AdamW

billhhh commented 5 years ago

Hi snakers4, thx for your advice.

I have checked your repo before. In terms of why it is better: if I just use SGD, it already achieved 65% top1. The reason might be RELU. And during my training of Mnasnet, I found that the representation power is a little bit weak since the training loss is higher than the testing loss, so I change the dropout rate from default 0.5 to 0.0, which indeed boosted the performance to 68%.

I have also tried adam and rmsprop, but they just cannot converge in my case.

snakers4 commented 5 years ago

if I just use SGD, it already achieved 65% top1

You mean just using my model with SGD or your model?

so I change the dropout rate from default 0.5 to 0.0, which indeed boosted the performance to 68%

Interesting, afaik we did not use any dropout at all

I have also tried adam and rmsprop, but they just cannot converge in my case.

Interesting. Well, anyway, just give adamw and a larger batch a try =)

snakers4 commented 5 years ago

Also @Randl trained MobileNet2 with adam and SGD, adam converged 3x faster, but SGD converged only +1 pp better ...

All of this tells me that the newer networks are getting more and more fragile ...

billhhh commented 5 years ago

My model + SGD

billhhh commented 5 years ago

Agree with you, newer nets should be carefully tuned. Still don't know how the paper get 74%. Maybe large batchsize matters, but currently I may not have that big computation power to do it

snakers4 commented 5 years ago

We will see what @Randl will comment, he has more GPUs now afaik

huxianer commented 5 years ago

@billhhh I use this code ,but the loss is not change,could you help to solve it~

Randl commented 5 years ago

I've managed to achieve 72+% top-1, however, I also managed to fuck up checkpointing, thus there is no checkpoint (yet).

billhhh commented 5 years ago

@Randl Wow, that's pretty good result! Did you use 224 input? How about other settings? The same as mine or different?

xi-mao commented 2 years ago

nyway, just give adamw and a larger batch a try =)

have you solved the problem? i train the network but the loss did not drop.