Optimizer fixes and refactor

mfernezir commented 7 years ago

Changes:

api fixes so that all parameters get forwarded to underlying public methods in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/optimizer.py (this also solves global step issues)
refactor YellowFin optimizer to inherit from TF Optimizer class, like other TF optimizers
cleanup and small improvements (removed unused imports, some doc updates, removed trailing white spaces, improved formatting for better readability)

I've tested it as a stand-alone module in my current TF 1.2r Python 3 setup on multiple GPU-s, working okay.

EDIT: In the last commit, I have reverted back to the original choice of YellowFin inheriting from Object rather than from tf.train.Optimizer. This ensures that the optimizer doesn't inherit public and private methods from the Optimizer class that wouldn' work. Reference: https://github.com/tensorflow/tensorflow/blob/349932f4400d15d610f7b6e51923c6a60ddd186b/tensorflow/python/training/optimizer.py#L181

Compared to the original implementation, I have added wrappers to all other public methods in tf.train.Optimizer (get_name, get_slot_names and get _slot). For the already present wrappers to the public methods minimize, compute_gradients and apply_gradients, I have made changes so that they support all input arguments that are supported by the TF API.

In addition, since YF Optimizer uses the underlying tf.train.MomentumOptimizer which supports the Nesterov=True option, I have also made it possible to instantiate YF with that option. See below for graphs on CIFAR 10.

JianGoForIt commented 7 years ago

Hi @mfernezir,

Thanks for the wonderful pull request on refactoring. It will be a great help to the smooth use in the community. It is super that it works smooth with multiple GPU now. We are very happy to merge the pull request!

As a double sanity check before merging, could you please do me the following favor:

Could you run one or two of the experiments (ideally one LSTM and one ResNet) in the repo with YF before/after the pull request, using a single GPU. Do you get reasonably similar performance? Maybe using the same random seed? (I totally understand even with the same seed, numbers maybe slightly different).
Could you please run synchronous multiple-GPU YF on an experiment in the repo? If it gives similar results as the single gpu results (using same effective batchsize, say single gpu uses batch size 100 and 2-gpu each with 50 batch size), it would be a strong evidence to verify the the multiple gpu version.

Thanks again for the contribution. Please let us know about the results.

Best

mfernezir commented 7 years ago

Hi! I've just started tests on CIFAR 10 on a single GPU. I'll have the results in the morning and post them later on tomorrow.

I am using Python 3 and there were some print issues that I had to fix, unrelated with this merge request. I'll setup TF 1.2. Python 2 for further tests.

For easier comparison, I have added tf.set_random_seed(1729) right after tf.reset_default_graph() in CIFAR10-release.py.

For multiple GPU tests, the scripts in this repository don't support that. However, the YF optimizer can be used as a stand-alone module and I have a framework at work that can take any TF optimizer. I don't have the exact networks that you have readily available, but I'll use it on our own convolutional networks that we are using for large scale image classification (millions of images and hundredths of classes). I'll run some tests in the next few days and post some results.

Regarding the actual merge request, I've thought about it some more and I'll make some additional changes. None of it should break anything and in any case, I'll test it again with my most recent version.

JianGoForIt commented 7 years ago

@mfernezir Thanks for the efforts and detailed info. Please keep us updated. :)

mfernezir commented 7 years ago

CIFAR 10 graphs look okay:

new_fig_loss_iter_40000.pdf original_fig_loss_iter_40000.pdf

JianGoForIt commented 7 years ago

@mfernezir Thanks for the figures. Once you have a results from a multiple GPU setting, which is OK comparing to the 1 gpu case with the same setting, I will merge them to the repo. Please keep us updated.

Thanks

mfernezir commented 7 years ago

I have made some additional changes and updated the pull request description. Here are the new graphs for CIFAR 10, still looking very similar to the original implementation's graph like they should.

original: https://github.com/JianGoForIt/YellowFin/files/1206803/original_fig_loss_iter_40000.pdf new implementation: v2_new_fig_loss_iter_40000.pdf new implementation with Nesterov=True v2_new_nesterov_fig_loss_iter_40000.pdf

I am going to do more tests later on today and tomorrow.

JianGoForIt commented 7 years ago

@mfernezir Thanks for the following up.

Our method is designed from Polyak momentum, slightly different from Nesterov's. Some of the argument/reasoning there might not strictly holds with Nesterov's. But it is great to see it also empirically works with Nesterov's momentum. My suggestion is keeping default to be nesterov=False.

mfernezir commented 7 years ago

Sure, I've kept the default for tf.train.MomentumOptimizer which is use_nesterov=False.

Here are the comparisons for single GPU vs 2 GPU on a large classification dataset I'm working with.

Setup:

2 cards: batch size 128 for each card, optimizer = YFOptimizer(learning_rate=0.01, momentum=0.0)
1 card: batch size 256, optimizer = YFOptimizer(learning_rate=0.01, momentum=0.0)
1 card: batch size 256, optimizer = YFOptimizer(learning_rate=0.01, momentum=0.0, use_nesterov=True)

It would take a lot of time to fully train this dataset from scratch. Here are the results after one night of training:

Accuracy

Total loss total_loss

As you can see, all validation curves are close to each other and likewise for training curves in all three tests. In addition, I've tracked the internal optimizer._lr_var and optimizer._mu_var:

Learning rate learning_rate

Momentum

Note that I had to reduce the initial learning rate to 0.01 or I would otherwise get NaN errors. This is surely related to the actual neural network choice (no batch normalization) and dataset at hand.

All in all, these plots confirm that the optimizer works as a stand-alone module in a multi GPU training framework.

JianGoForIt commented 7 years ago

Hi @mfernezir,

Thanks for all the efforts on the standardization of YellowFin. I have merged the PR.

Regarding lr=0.01, it is one possible solution. Alternative approach is gradient clipping, we have some users reporting it to be useful and give good performance.

JianGoForIt / YellowFin

Optimizer fixes and refactor #18