forresti / SqueezeNet

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters
BSD 2-Clause "Simplified" License
2.17k stars 723 forks source link

v1.1 loss does not decrease #18

Closed kli-casia closed 7 years ago

kli-casia commented 8 years ago

This is my training log.

https://gist.github.com/kli-nlpr/e0705a0d58a04178b8e6dbe554e7f072

The traning loss is always about 6.9....

I use the same train_val.prototxt and solver.prototxt as yours.

Thanks.

forresti commented 8 years ago

Interesting. Has this happened more than once? (Depending on random seed, I find that even AlexNet occasionally doesn't learn.)

The next place I'd look is to make sure that the training data is sane,

kli-casia commented 8 years ago

I think the train val lmdb dataset is ok

240G    ilsvrc12_train_lmdb/
9.4G    ilsvrc12_val_lmdb/

I use create_imagenet.sh to create these files.

I will train the network again using a different seed.

Thanks.

Grabber commented 8 years ago

I'm facing the same problem to train SqueezeNet on Darknet... network loss got stuck 5~6 after 80k iterations.

Detection **Avg IOU: 0.354223**, Pos Cat: 0.995564, All Cat: 0.995564, Pos Obj: 0.010827, Any Obj: 0.004997, count: 23
Detection **Avg IOU: 0.352646**, Pos Cat: 0.995592, All Cat: 0.995592, Pos Obj: 0.010121, Any Obj: 0.004997, count: 23
Detection **Avg IOU: 0.368458**, Pos Cat: 0.996199, All Cat: 0.996199, Pos Obj: 0.013484, Any Obj: 0.004997, count: 31
Detection **Avg IOU: 0.384327**, Pos Cat: 0.995394, All Cat: 0.995394, Pos Obj: 0.010156, Any Obj: 0.004997, count: 27
18488: 4.462246, **5.304660 avg**, 0.026577 rate, 4.329206 seconds, 1183232 images
Loaded: 0.000039 seconds

I my case my dataset is converging on vanilla AlexNet, but not converging with SqueezeNet no matters how many times I start training from scratch. Note: Darknet doesn't implement xavier initialization, so I'm using default random initialization.

SqueezeNet v1.1 port for Darknet: https://gist.github.com/Grabber/65760c4b4e5b4cf9a82f11193a8154dd

@forresti do you have any insight on it? Please answer my messages on LinkedIn or hang up the phone ;)

kli-casia commented 8 years ago

I use caffe, and comment out the random_seed: 42 in solver.prototxt. Now SqueezeNet v1.1 works very well. Here is my training log https://gist.github.com/kli-nlpr/5f54a24a1215af9fcd9faaf16be6d54d

Grabber commented 8 years ago

Don't have more ideas to overcome loss being stuck at 5~6 on Darknet.

bluekingdom commented 8 years ago

I met the same problem! When train with learning rate within [0.1, 0.01], loss always stay in 6.9... then i use learning rate in 0.001, loss decrease well. i don't know whether it world be overfitting or any other problems when train begin in 0.001 ?

Grabber commented 8 years ago

Could uou share your solver.txt?

On Wednesday, 27 July 2016, blue notifications@github.com wrote:

I met the same problem! When train with learning rate within [0.1, 0.01], loss always stay in 6.9... then i use learning rate in 0.001, loss decrease well. i don't know whether it world be overfitting or any other problems when train begin in 0.001 ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DeepScale/SqueezeNet/issues/18#issuecomment-235772024, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA9c3LqiULedIipe3YxR222w-jHrk92ks5qaAbOgaJpZM4JQTLh .

Regards,

Luiz Vitor Martinez Cardoso

CEO & Founder at geeksys.com.br http://www.geeksys.com.br Phone: (11) 97351-7097 | Skype: grabberbr

"If you wanna be successful, you need total dedication, go for your last limit, give your best and love your love infinitely!"

"The only limits are the ones you place upon yourself"

bluekingdom commented 8 years ago

Could uou share your solver.txt?

i no use the default solver. net: "protoueezenet_face.prototxt" snapshot_prefix: "snapshotueezenet_face_1" base_lr: 0.001 display: 1000 test_interval: 5000 snapshot: 5000 test_iter: 100 lr_policy: "step" gamma: 0.1 stepsize: 100000 max_iter: 300000 momentum: 0.9 weight_decay: 0.0002 solver_mode: GPU

Grabber commented 8 years ago

@bluekingdom thank you! On Darknet framework it is nothing working either... loss at a certain point (5~6) stop decreasing.

mrgloom commented 7 years ago

Same problem, learning very unstable. I have tested it on 20k images from https://www.kaggle.com/c/dogs-vs-cats At the start of the training for 5-10 epochs learning curve looks like dead, but than it start learning. Also even worse that I have successfully trained it once but then can't reproduce results with same settings!

Grabber commented 7 years ago

@mrgloom what framework are you using for training?

mrgloom commented 7 years ago

I'm using NVIDIA DIGITS with Caffe backend.

wyasuda commented 7 years ago

Hey, I used the same data set. https://www.kaggle.com/c/dogs-vs-cats and my solver.prototxt is shown below.

test_iter: 1000 test_interval: 1000 base_lr: 0.04 display: 40 max_iter: 100000 iter_size: 16 lr_policy: "poly" power: 1.0 momentum: 0.9 weight_decay: 0.0002 snapshot: 10000 snapshot_prefix: "D:/deeplearning-cats-dogs-tutorial/caffe_models/SqueezeNet/SqueezeNet" solver_mode: GPU random_seed: 42 net: "D:/deeplearning-cats-dogs-tutorial/caffe_models/SqueezeNet/train_val_v1.1.prototxt" test_initialization: false average_loss: 40

Please see attached learning curve. The accuracy is only <0.8 even though it is just 2-class classification.

learning_curve_part

Are there any problems in my solver setting?

mrgloom commented 7 years ago

You can check this out https://github.com/mrgloom/kaggle-dogs-vs-cats-solution/tree/master/learning_from_scratch/Models/SqeezeNet_v1.1 However, I can't reproduce results with same settings.

wyasuda commented 7 years ago

I used solver.prototxt shown below.

test_iter: 100 test_interval: 1000 base_lr: 0.001 display: 100 max_iter: 30000 iter_size: 16 lr_policy: "poly" power: 1.0 momentum: 0.9 weight_decay: 0.0002 snapshot: 1000 snapshot_prefix: "D:/deeplearning-cats-dogs-tutorial/caffe_models/SqueezeNet/SqueezeNet" solver_mode: GPU net: "D:/deeplearning-cats-dogs-tutorial/caffe_models/SqueezeNet/train_val_v1.1.prototxt" test_initialization: false

Now, it looks better. learning_curve

I am not sure if this is the best for SqueezeNet 2-class classification.

forresti commented 7 years ago

For the people who are experimenting with Dogs vs Cats... this person did some experiments with SqueezeNet and other models for a similar challenge: https://florianbordes.wordpress.com/2016/04/16/cats-vs-dogs-12-summary-and-conclusion/

mrgloom commented 7 years ago

Here is working example https://github.com/mrgloom/kaggle-dogs-vs-cats-solution/tree/master/learning_from_scratch/Models/SqeezeNet_v1.1