Gradient Accumulation - Githubissues

Simsso commented 5 years ago

Since our batch size is very limited right now (due to the inefficient memory usage of the VQ layer, as described in #58), we should implement a "virtual batch size", i.e. accumulate multiple gradients and apply them at once. That way we could specify a compute_batch_size and a update_batch_size, where the former depends on the GPU memory and the latter on our preference.

This SO answer contains relevant information.

Development in the gradient-accumulation branch

Simsso commented 5 years ago

Validation

The implementation was validated by running with different configuration on CPU; random seeds are all specified. The log output is expected to be identical; however it is only similar.

batch_size=8, virtual_batch_size_factor=1

DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:8.706924
DEBUG:tensorflow:0.165
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:[ 7826.42  8236.92   564.34   165.1    342.76 11628.6   1004.7   2999.16]
DEBUG:tensorflow:0.0
DEBUG:tensorflow:8.916469
DEBUG:tensorflow:0.12

batch_size=4, virtual_batch_size_factor=2

DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:9.405837
DEBUG:tensorflow:0.11
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:[3870.42 4101.34  308.4    84.58  177.26 5830.18  465.78 1546.04]
DEBUG:tensorflow:0.0
DEBUG:tensorflow:9.426175

Update

Setting a seed for the input pipeline data = data.shuffle(buffer_size=self.__get_num_samples(mode), seed=15092017) did not make the delta vanish either.

Update 2

Averaging the gradients rather than just summing them up, following these notes. I cannot even reproduce the same results without any changes to hyperparameters on the CPU, so for now it must suffice, that the results of the following three configurations are similar. Might be related to the fact that there are global and operation-level random seeds.

batch_size=1, virtual_batch_size_factor=8

DEBUG:tensorflow:0.025
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.05
DEBUG:tensorflow:0.025
DEBUG:tensorflow:0.0
DEBUG:tensorflow:10.77883
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.05
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:[1023.44  955.54   68.96   19.96   39.22 1493.86   85.08  409.94]
DEBUG:tensorflow:0.0
DEBUG:tensorflow:10.791368

batch_size=2, virtual_batch_size_factor=4

DEBUG:tensorflow:0.025
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.025
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:10.199922
DEBUG:tensorflow:0.02
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:[1992.36 2000.1   140.34   36.26   80.28 2929.6   229.14  783.92]
DEBUG:tensorflow:0.0
DEBUG:tensorflow:9.939659

batch_size=8, virtual_batch_size_factor=1

DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:8.620282
DEBUG:tensorflow:0.14
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:[ 7850.64  8113.28   606.38   143.86   341.58 11574.56   989.28  3148.42]
DEBUG:tensorflow:0.0
DEBUG:tensorflow:8.725957

Update 3

Updating the accumulators to compute the correct mean (https://github.com/Simsso/NIPS-2018-Adversarial-Vision-Challenge/commit/01343e924d21bd416c05a361ff1a4b9f9cf2c5ff).

batch_size=1, virtual_batch_size_factor=16

DEBUG:tensorflow:0.017307693
DEBUG:tensorflow:0.0078125
DEBUG:tensorflow:0.009375
DEBUG:tensorflow:0.0140625
DEBUG:tensorflow:0.0109375
DEBUG:tensorflow:10.768374
DEBUG:tensorflow:0.0025974025
DEBUG:tensorflow:0.0109375
DEBUG:tensorflow:0.0125
DEBUG:tensorflow:0.0140625
DEBUG:tensorflow:0.0078125
DEBUG:tensorflow:[15323.34 16056.18  1180.34   305.54   679.72 22653.    1986.98  6122.1 ]
DEBUG:tensorflow:0.00625
DEBUG:tensorflow:10.764653
DEBUG:tensorflow:0.0025

batch_size=2, virtual_batch_size_factor=8

DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:10.100628
DEBUG:tensorflow:0.051813472
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:[15231.2  16149.54  1166.62   313.34   666.98 22715.8   1992.86  6152.78]
DEBUG:tensorflow:0.0
DEBUG:tensorflow:10.145672
DEBUG:tensorflow:0.04

batch_size=16, virtual_batch_size_factor=1

DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:8.341681
DEBUG:tensorflow:0.15
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:[15624.3  16319.12  1195.44   309.24   689.42 23128.82  2012.88  6256.78]
DEBUG:tensorflow:0.0
DEBUG:tensorflow:8.4202175
DEBUG:tensorflow:0.17

Simsso commented 5 years ago

Testing the new feature with more realistic values, i.e. a virtual batch size of 512.

Getting nan losses, so there is still something broken.

Simsso commented 5 years ago

The nan loss was caused by the coulomb loss and batch sizes < 3. To be investigated but not an issue with the accumulation. Verification of the method by comparing the embedding space of two runs with 2/64 and 1/128 (batch size / factor):

INFO:tensorflow:Restoring parameters from /Users/timodenk/.models/tiny_imagenet_alp05_2018_06_26.ckpt
INFO:tensorflow:Model loaded from /Users/timodenk/.models/tiny_imagenet_alp05_2018_06_26.ckpt
DEBUG:tensorflow:loss: 10.947551727294922
INFO:tensorflow:Writing custom summary object to '../tf_logs/train'
DEBUG:tensorflow:[[ 1.2112676e-04 -4.0615603e-05]
 [-6.1216939e-05  1.4470359e-04]
 [-1.0199162e-04  2.1006863e-05]
 [-3.8992225e-06  1.8839956e-04]
 [-1.9535322e-05 -9.4587798e-05]
 [ 8.0312939e-06 -1.1369006e-04]
 [-2.9969631e-05  5.5452703e-05]
 [ 7.1249371e-05 -1.7472272e-05]]
DEBUG:tensorflow:loss: 11.144996643066406
INFO:tensorflow:Writing custom summary object to '../tf_logs/train'
DEBUG:tensorflow:[[ 1.9875291e-04 -4.2323518e-07]
 [-1.9375242e-05  1.8461425e-04]
 [-6.3313651e-05  5.9905855e-05]
 [ 5.3762626e-05  2.7118195e-04]
 [ 2.3648290e-05 -5.5787561e-05]
 [ 5.1927189e-05 -8.2355669e-05]
 [ 1.3328043e-05  9.4276766e-05]
 [ 1.1443735e-04  2.1390109e-05]]
DEBUG:tensorflow:loss: 11.095342636108398
INFO:tensorflow:Writing custom summary object to '../tf_logs/train'
DEBUG:tensorflow:[[ 2.7571080e-04  3.9752762e-05]
 [ 2.2588189e-05  2.2416531e-04]
 [-2.5286132e-05  9.9098761e-05]
 [ 1.1023581e-04  3.5502610e-04]
 [ 6.6386943e-05 -1.6663031e-05]
 [ 9.5182186e-05 -5.0529168e-05]
 [ 5.6122793e-05  1.3335870e-04]
 [ 1.5713921e-04  6.0510451e-05]]
DEBUG:tensorflow:loss: 11.328489303588867

INFO:tensorflow:Restoring parameters from /Users/timodenk/.models/tiny_imagenet_alp05_2018_06_26.ckpt
INFO:tensorflow:Model loaded from /Users/timodenk/.models/tiny_imagenet_alp05_2018_06_26.ckpt
DEBUG:tensorflow:loss: 11.766935348510742
INFO:tensorflow:Writing custom summary object to '../tf_logs/train'
DEBUG:tensorflow:[[ 1.2112676e-04 -4.0615603e-05]
 [-6.1216939e-05  1.4470359e-04]
 [-1.0199162e-04  2.1006863e-05]
 [-3.8992225e-06  1.8839956e-04]
 [-1.9535322e-05 -9.4587798e-05]
 [ 8.0312939e-06 -1.1369006e-04]
 [-2.9969631e-05  5.5452703e-05]
 [ 7.1249371e-05 -1.7472272e-05]]
DEBUG:tensorflow:loss: 11.426416397094727
INFO:tensorflow:Writing custom summary object to '../tf_logs/train'
DEBUG:tensorflow:[[ 1.9868946e-04 -3.7503560e-07]
 [-1.9458970e-05  1.8465455e-04]
 [-6.3484520e-05  5.9971448e-05]
 [ 5.3674776e-05  2.7121359e-04]
 [ 2.3566407e-05 -5.5729455e-05]
 [ 5.1827436e-05 -8.2223574e-05]
 [ 1.3246559e-05  9.4329225e-05]
 [ 1.1435178e-04  2.1445887e-05]]
DEBUG:tensorflow:loss: 11.425291061401367
INFO:tensorflow:Writing custom summary object to '../tf_logs/train'
DEBUG:tensorflow:[[ 2.7538859e-04  3.9786079e-05]
 [ 2.2343800e-05  2.2417982e-04]
 [-2.5823792e-05  9.9169200e-05]
 [ 1.0993513e-04  3.5509333e-04]
 [ 6.6092754e-05 -1.6598289e-05]
 [ 9.4845207e-05 -5.0330091e-05]
 [ 5.5835822e-05  1.3341877e-04]
 [ 1.5684891e-04  6.0576735e-05]]
DEBUG:tensorflow:loss: 11.481734275817871

Simsso / NIPS-2018-Adversarial-Vision-Challenge

Gradient Accumulation #64

Validation

Update

Update 2

Update 3