XuezheMax / apollo

Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization
Apache License 2.0
180 stars 17 forks source link

Any expectation on noisy data? #3

Closed soloice closed 4 years ago

soloice commented 4 years ago

This is the first quasi-Newton like optimizer that really works on training NNs! I really appreciate it!

Recently I tried this optimizer on my dataset, which is quite noisy and difficult (i.e.: one cannot get an accuracy of >0.6 for the binary classification task, regardless of model architectures and optimizers chozen). I did some hyper-parameter tuning (though not too much), e.g.: set init_lr from 1e-3 to 1e-2, and set lr from 0.01 to 1.0, but got no better results than Adam.

Is this expected? I suppose this is because Apollo needs to estimate Hessian somehow, while in my settings it's really difficult to do so. On the other hand, Adam and other SGD variants rely merely on the first-order information and are more robust.

XuezheMax commented 4 years ago

Thanks for your interest.

For your question about noisy data, I am not sure how the data look like. But I don't think that Apollo is less robust than SGD and/or Adam on noisy data. Apollo does approximate Hessian, but only using first-order information.

In addition, warmup is super important for Apollo. Please use at least 100 updates for warmup. For your experiments, I suggest to set lr=1.0, init_lr=0.01 and warmup=100 to see how it compares with Adam.

Would you please paste more details of your experiments (log of training) so that we can better analyze the results?

XuezheMax commented 4 years ago

As mentioned in the paper, another important hyper-parameter is weight decay. Since the scale of lr is very different between Apollo and Adam, we need to re-tune the weight decay so that lr_adam weight_decay_adam = lr_apollo weight_decay_apollo. For example, if we set lr = 1e-3 and weight_decay=1e-4 for Adam and lr = 1.0 for Apollo, the optimal weight_decay for Apollo should be about 1e-7.

soloice commented 4 years ago

I'm not using weight decay.

The task is stock price prediction (regression instead of classification), and privacy reasons prevent me from posting my training log here. And I guess loss is not a good indicator of training progress because it vibrates just like GAN losses: too noisy to see a downward trend (no matter which optimizer is used). Though I do record losses during training, I seldom check them. Instead, I check correlation between model predictions and ground truth prices periodically to pick the best model.

SGD can't match the performance of Adam on my task, either.

XuezheMax commented 4 years ago

interesting. How large is the gap between Apollo (or SGD) and Adam? And have you tried RAdam?

soloice commented 4 years ago

Just tried out: RAdam performs roughly the same as Adam does.

XuezheMax commented 4 years ago

Thanks for your feedback. If you could provide more information about your task, we can analyze why Apollo underperforms Adam and/or RAdam.