Closed Simsso closed 5 years ago
The implementation was validated by running with different configuration on CPU; random seeds are all specified. The log output is expected to be identical; however it is only similar.
batch_size=8
, virtual_batch_size_factor=1
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:8.706924
DEBUG:tensorflow:0.165
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:[ 7826.42 8236.92 564.34 165.1 342.76 11628.6 1004.7 2999.16]
DEBUG:tensorflow:0.0
DEBUG:tensorflow:8.916469
DEBUG:tensorflow:0.12
batch_size=4
, virtual_batch_size_factor=2
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:9.405837
DEBUG:tensorflow:0.11
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:[3870.42 4101.34 308.4 84.58 177.26 5830.18 465.78 1546.04]
DEBUG:tensorflow:0.0
DEBUG:tensorflow:9.426175
Setting a seed for the input pipeline data = data.shuffle(buffer_size=self.__get_num_samples(mode), seed=15092017)
did not make the delta vanish either.
Averaging the gradients rather than just summing them up, following these notes. I cannot even reproduce the same results without any changes to hyperparameters on the CPU, so for now it must suffice, that the results of the following three configurations are similar. Might be related to the fact that there are global and operation-level random seeds.
batch_size=1
, virtual_batch_size_factor=8
DEBUG:tensorflow:0.025
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.05
DEBUG:tensorflow:0.025
DEBUG:tensorflow:0.0
DEBUG:tensorflow:10.77883
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.05
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:[1023.44 955.54 68.96 19.96 39.22 1493.86 85.08 409.94]
DEBUG:tensorflow:0.0
DEBUG:tensorflow:10.791368
batch_size=2
, virtual_batch_size_factor=4
DEBUG:tensorflow:0.025
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.025
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:10.199922
DEBUG:tensorflow:0.02
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:[1992.36 2000.1 140.34 36.26 80.28 2929.6 229.14 783.92]
DEBUG:tensorflow:0.0
DEBUG:tensorflow:9.939659
batch_size=8
, virtual_batch_size_factor=1
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:8.620282
DEBUG:tensorflow:0.14
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:[ 7850.64 8113.28 606.38 143.86 341.58 11574.56 989.28 3148.42]
DEBUG:tensorflow:0.0
DEBUG:tensorflow:8.725957
Updating the accumulators to compute the correct mean (https://github.com/Simsso/NIPS-2018-Adversarial-Vision-Challenge/commit/01343e924d21bd416c05a361ff1a4b9f9cf2c5ff).
batch_size=1
, virtual_batch_size_factor=16
DEBUG:tensorflow:0.017307693
DEBUG:tensorflow:0.0078125
DEBUG:tensorflow:0.009375
DEBUG:tensorflow:0.0140625
DEBUG:tensorflow:0.0109375
DEBUG:tensorflow:10.768374
DEBUG:tensorflow:0.0025974025
DEBUG:tensorflow:0.0109375
DEBUG:tensorflow:0.0125
DEBUG:tensorflow:0.0140625
DEBUG:tensorflow:0.0078125
DEBUG:tensorflow:[15323.34 16056.18 1180.34 305.54 679.72 22653. 1986.98 6122.1 ]
DEBUG:tensorflow:0.00625
DEBUG:tensorflow:10.764653
DEBUG:tensorflow:0.0025
batch_size=2
, virtual_batch_size_factor=8
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:10.100628
DEBUG:tensorflow:0.051813472
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:[15231.2 16149.54 1166.62 313.34 666.98 22715.8 1992.86 6152.78]
DEBUG:tensorflow:0.0
DEBUG:tensorflow:10.145672
DEBUG:tensorflow:0.04
batch_size=16
, virtual_batch_size_factor=1
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:8.341681
DEBUG:tensorflow:0.15
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:0.0
DEBUG:tensorflow:[15624.3 16319.12 1195.44 309.24 689.42 23128.82 2012.88 6256.78]
DEBUG:tensorflow:0.0
DEBUG:tensorflow:8.4202175
DEBUG:tensorflow:0.17
Testing the new feature with more realistic values, i.e. a virtual batch size of 512.
Getting nan losses, so there is still something broken.
The nan
loss was caused by the coulomb loss and batch sizes < 3. To be investigated but not an issue with the accumulation. Verification of the method by comparing the embedding space of two runs with 2/64 and 1/128 (batch size / factor):
INFO:tensorflow:Restoring parameters from /Users/timodenk/.models/tiny_imagenet_alp05_2018_06_26.ckpt
INFO:tensorflow:Model loaded from /Users/timodenk/.models/tiny_imagenet_alp05_2018_06_26.ckpt
DEBUG:tensorflow:loss: 10.947551727294922
INFO:tensorflow:Writing custom summary object to '../tf_logs/train'
DEBUG:tensorflow:[[ 1.2112676e-04 -4.0615603e-05]
[-6.1216939e-05 1.4470359e-04]
[-1.0199162e-04 2.1006863e-05]
[-3.8992225e-06 1.8839956e-04]
[-1.9535322e-05 -9.4587798e-05]
[ 8.0312939e-06 -1.1369006e-04]
[-2.9969631e-05 5.5452703e-05]
[ 7.1249371e-05 -1.7472272e-05]]
DEBUG:tensorflow:loss: 11.144996643066406
INFO:tensorflow:Writing custom summary object to '../tf_logs/train'
DEBUG:tensorflow:[[ 1.9875291e-04 -4.2323518e-07]
[-1.9375242e-05 1.8461425e-04]
[-6.3313651e-05 5.9905855e-05]
[ 5.3762626e-05 2.7118195e-04]
[ 2.3648290e-05 -5.5787561e-05]
[ 5.1927189e-05 -8.2355669e-05]
[ 1.3328043e-05 9.4276766e-05]
[ 1.1443735e-04 2.1390109e-05]]
DEBUG:tensorflow:loss: 11.095342636108398
INFO:tensorflow:Writing custom summary object to '../tf_logs/train'
DEBUG:tensorflow:[[ 2.7571080e-04 3.9752762e-05]
[ 2.2588189e-05 2.2416531e-04]
[-2.5286132e-05 9.9098761e-05]
[ 1.1023581e-04 3.5502610e-04]
[ 6.6386943e-05 -1.6663031e-05]
[ 9.5182186e-05 -5.0529168e-05]
[ 5.6122793e-05 1.3335870e-04]
[ 1.5713921e-04 6.0510451e-05]]
DEBUG:tensorflow:loss: 11.328489303588867
INFO:tensorflow:Restoring parameters from /Users/timodenk/.models/tiny_imagenet_alp05_2018_06_26.ckpt
INFO:tensorflow:Model loaded from /Users/timodenk/.models/tiny_imagenet_alp05_2018_06_26.ckpt
DEBUG:tensorflow:loss: 11.766935348510742
INFO:tensorflow:Writing custom summary object to '../tf_logs/train'
DEBUG:tensorflow:[[ 1.2112676e-04 -4.0615603e-05]
[-6.1216939e-05 1.4470359e-04]
[-1.0199162e-04 2.1006863e-05]
[-3.8992225e-06 1.8839956e-04]
[-1.9535322e-05 -9.4587798e-05]
[ 8.0312939e-06 -1.1369006e-04]
[-2.9969631e-05 5.5452703e-05]
[ 7.1249371e-05 -1.7472272e-05]]
DEBUG:tensorflow:loss: 11.426416397094727
INFO:tensorflow:Writing custom summary object to '../tf_logs/train'
DEBUG:tensorflow:[[ 1.9868946e-04 -3.7503560e-07]
[-1.9458970e-05 1.8465455e-04]
[-6.3484520e-05 5.9971448e-05]
[ 5.3674776e-05 2.7121359e-04]
[ 2.3566407e-05 -5.5729455e-05]
[ 5.1827436e-05 -8.2223574e-05]
[ 1.3246559e-05 9.4329225e-05]
[ 1.1435178e-04 2.1445887e-05]]
DEBUG:tensorflow:loss: 11.425291061401367
INFO:tensorflow:Writing custom summary object to '../tf_logs/train'
DEBUG:tensorflow:[[ 2.7538859e-04 3.9786079e-05]
[ 2.2343800e-05 2.2417982e-04]
[-2.5823792e-05 9.9169200e-05]
[ 1.0993513e-04 3.5509333e-04]
[ 6.6092754e-05 -1.6598289e-05]
[ 9.4845207e-05 -5.0330091e-05]
[ 5.5835822e-05 1.3341877e-04]
[ 1.5684891e-04 6.0576735e-05]]
DEBUG:tensorflow:loss: 11.481734275817871
Since our batch size is very limited right now (due to the inefficient memory usage of the VQ layer, as described in #58), we should implement a "virtual batch size", i.e. accumulate multiple gradients and apply them at once. That way we could specify a
compute_batch_size
and aupdate_batch_size
, where the former depends on the GPU memory and the latter on our preference.This SO answer contains relevant information.
Development in the
gradient-accumulation
branch