XtractOpen / Meganet.jl

A fresh approach to deep learning written in Julia
http://www.xtract.ai/
Other
14 stars 9 forks source link

Splitting up batches and summing the derivatives seems to be inaccurate #61

Closed klensink closed 6 years ago

klensink commented 6 years ago

I found that the parallel version of the code was learning slower than the serial version, and often the loss function was increasing significantly. To track down what was going wrong I rolled way back and implemented the simplest form of synchronous parallelism.

In this current commit all I have done is split each batch, and then sum the derivatives as we discussed the other day. The shuffling is handled by the master and nothing is left to chance, so that we can be certain that each worker is doing half of the work that would have been done if we had just stayed on master. The network is generated with a fixed RNG seed, so the results are reproducible and randomness is not an issue.

The code in question

When I run this version of the code with only 1 worker, the derivative is calculated all on one worker so it exactly matches test, a ground truth that is being calculated by master.

klensink:Meganet$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> include(Pkg.dir("Meganet")*"/examples/EResNN_CIFAR10.jl")
-- Neural Network --
nLayers:     10
nFeatIn:     3072
nFeatOut:    64
nTheta:      104656
SGD(maxEpochs=200,miniBatch=64,learningRate=0.01,momentum=0.9,nesterov=true,ADAM=false)
Using 1 workers...
Residual (1 workers) : 0.0
Residual (1 workers) : 0.0
Residual (1 workers) : 0.0
Residual (1 workers) : 0.0
Residual (1 workers) : 0.0
Residual (1 workers) : 0.0
Residual (1 workers) : 0.0
Residual (1 workers) : 0.0
1   2.02e+00    25.20   1.40e-01    2.05e+00    21.36

Now running the same job with two workers, the derivative is calculated in two parts which are summed, and the result is different from the ground truth.

klensink:Meganet$ julia -p 2
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> include(Pkg.dir("Meganet")*"/examples/EResNN_CIFAR10.jl")
-- Neural Network --
nLayers:     10
nFeatIn:     3072
nFeatOut:    64
nTheta:      104656
SGD(maxEpochs=200,miniBatch=64,learningRate=0.01,momentum=0.9,nesterov=true,ADAM=false)
Using 2 workers...
Residual (2 workers) : 0.06707617
Residual (2 workers) : 0.06697881
Residual (2 workers) : 0.042924166
Residual (2 workers) : 0.033434458
Residual (2 workers) : 0.029714787
Residual (2 workers) : 0.05732619
Residual (2 workers) : 0.043897737
Residual (2 workers) : 0.026612915
1   2.02e+00    27.34   2.57e-01    2.04e+00    24.27

@eldadHaber Have I misunderstood what we talked about yesterday? I thought for sure that this shouldn't be a problem.

eldadHaber commented 6 years ago

Very hard to track this by email. How about tomorrow aft?

E

On Feb 21, 2018, at 7:08 PM, Keegan Lensink notifications@github.com wrote:

I found that the parallel version of the code was learning slower than the serial version, and often the loss function was increasing significantly. To track down what was going wrong I rolled way back and implemented the simplest form of synchronous parallelism.

In this current commit all I have done is split each batch, and then sum the derivatives as we discussed the other day. The shuffling is handled by the master and nothing is left to chance, so that we can be certain that each worker is doing half of the work that would have been done if we had just stayed on master. The network is generated with a fixed RNG seed, so the results are reproducible and randomness is not an issue.

The code in question

When I run this version of the code with only 1 worker, the derivative is calculated all on one worker so it exactly matches test, a ground truth that is being calculated by master.

klensink:Meganet$ julia () | A fresh approach to technical computing () | () () | Documentation: https://docs.julialang.org | |_ | Type "?help" for help. | | | | | | |/ ` | | | | || | | | (| | | Version 0.6.2 (2017-12-13 18:08 UTC) / |_'|||_'_| | Official http://julialang.org/ release |/ | x86_64-pc-linux-gnu

julia> include(Pkg.dir("Meganet")*"/examples/EResNN_CIFAR10.jl") -- Neural Network -- nLayers: 10 nFeatIn: 3072 nFeatOut: 64 nTheta: 104656 SGD(maxEpochs=200,miniBatch=64,learningRate=0.01,momentum=0.9,nesterov=true,ADAM=false) Using 1 workers... Residual (1 workers) : 0.0 Residual (1 workers) : 0.0 Residual (1 workers) : 0.0 Residual (1 workers) : 0.0 Residual (1 workers) : 0.0 Residual (1 workers) : 0.0 Residual (1 workers) : 0.0 Residual (1 workers) : 0.0 1 2.02e+00 25.20 1.40e-01 2.05e+00 21.36 Now running the same job with two workers, the derivative is calculated in two parts which are summed, and the result is different from the ground truth.

klensink:Meganet$ julia -p 2 () | A fresh approach to technical computing () | () () | Documentation: https://docs.julialang.org | |_ | Type "?help" for help. | | | | | | |/ ` | | | | || | | | (| | | Version 0.6.2 (2017-12-13 18:08 UTC) / |_'|||_'_| | Official http://julialang.org/ release |/ | x86_64-pc-linux-gnu

julia> include(Pkg.dir("Meganet")*"/examples/EResNN_CIFAR10.jl") -- Neural Network -- nLayers: 10 nFeatIn: 3072 nFeatOut: 64 nTheta: 104656 SGD(maxEpochs=200,miniBatch=64,learningRate=0.01,momentum=0.9,nesterov=true,ADAM=false) Using 2 workers... Residual (2 workers) : 0.06707617 Residual (2 workers) : 0.06697881 Residual (2 workers) : 0.042924166 Residual (2 workers) : 0.033434458 Residual (2 workers) : 0.029714787 Residual (2 workers) : 0.05732619 Residual (2 workers) : 0.043897737 Residual (2 workers) : 0.026612915 1 2.02e+00 27.34 2.57e-01 2.04e+00 24.27 @eldadHaber Have I misunderstood what we talked about yesterday? I thought for sure that this shouldn't be a problem.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

klensink commented 6 years ago

Works for me, I'll be there at 2:30

Thanks

lruthotto commented 6 years ago

I should be able to make it as well!

On 21 February 2018 at 23:08, Keegan Lensink notifications@github.com wrote:

Works for me, I'll be there at 2

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/XtractOpen/Meganet.jl/issues/61#issuecomment-367561374, or mute the thread https://github.com/notifications/unsubscribe-auth/AEDYB9W2ZuenQmUm4OGIBvBrZMXEcitJks5tXOgsgaJpZM4SOrHq .

jgranek commented 6 years ago

Just for traceability...the root of this problem was BATCH NORM. Have since switched to using TV-Norm or "Instance" norm. Need to do more research on the performance tradeoff and implications, but this fixes the parallelization problem.