GoogleNet training successful using PSGD method-- will Adam, AdaDelta work?

amplab / SparkNet

Distributed Neural Networks for Spark

MIT License

603 stars 172 forks source link

GoogleNet training successful using PSGD method-- will Adam, AdaDelta work? #140

Open nhe150 opened 8 years ago

nhe150 commented 8 years ago

1) I have used a modified version of SparkNet. 2) I have successfully trained GoogleNet from scratch using 2 machines covering entire imagenet(1.281167 million images) 3) I have achieved accuracy of 62.3% top-1 and 84.7% top-5 accuracy in 26 epocs.

Hopefully some statistician can prove PSGD is working at least with simple momentum, nesterov method... for AdaDelta, Adam (squared gradients) not sure about the implication of PSGD...

robertnishihara commented 8 years ago

Nice work! Thanks for the running the benchmark.

nhe150 commented 8 years ago

The key for distributed data training is starting from a model with top-1 accuracy high enough on GoogleNet.(say at least 5%, I call this step first opinion), Otherwise sparknet will stuck at random guessing instead of learning to improve higher accuracy.

I am starting to suspect the initial opinion is very critical and may cause bias latter on.

nhe150 commented 8 years ago

Hopefully some statistician can start from above observation to deduce first order method like PSGD 's momentum, nesterov momentum can converge with a reasonable starting point when dealing with dramatically different data in parallel with only occasional communication. I will try out method in utilized squared momentum method also for convergence.(AdaDelta, Adam, RMSProp etc...), There are strong evidence suggest squared momentum also converge using PSGD method.

nhe150 commented 8 years ago

And it seems the PSGD converge faster than all other methods. I will call this clustering wisdom in training.