Closed klensink closed 6 years ago
Very hard to track this by email. How about tomorrow aft?
E
On Feb 21, 2018, at 7:08 PM, Keegan Lensink notifications@github.com wrote:
I found that the parallel version of the code was learning slower than the serial version, and often the loss function was increasing significantly. To track down what was going wrong I rolled way back and implemented the simplest form of synchronous parallelism.
In this current commit all I have done is split each batch, and then sum the derivatives as we discussed the other day. The shuffling is handled by the master and nothing is left to chance, so that we can be certain that each worker is doing half of the work that would have been done if we had just stayed on master. The network is generated with a fixed RNG seed, so the results are reproducible and randomness is not an issue.
The code in question
When I run this version of the code with only 1 worker, the derivative is calculated all on one worker so it exactly matches test, a ground truth that is being calculated by master.
klensink:Meganet$ julia () | A fresh approach to technical computing () | () () | Documentation: https://docs.julialang.org | |_ | Type "?help" for help. | | | | | | |/ ` | | | | || | | | (| | | Version 0.6.2 (2017-12-13 18:08 UTC) / |_'|||_'_| | Official http://julialang.org/ release |/ | x86_64-pc-linux-gnu
julia> include(Pkg.dir("Meganet")*"/examples/EResNN_CIFAR10.jl") -- Neural Network -- nLayers: 10 nFeatIn: 3072 nFeatOut: 64 nTheta: 104656 SGD(maxEpochs=200,miniBatch=64,learningRate=0.01,momentum=0.9,nesterov=true,ADAM=false) Using 1 workers... Residual (1 workers) : 0.0 Residual (1 workers) : 0.0 Residual (1 workers) : 0.0 Residual (1 workers) : 0.0 Residual (1 workers) : 0.0 Residual (1 workers) : 0.0 Residual (1 workers) : 0.0 Residual (1 workers) : 0.0 1 2.02e+00 25.20 1.40e-01 2.05e+00 21.36 Now running the same job with two workers, the derivative is calculated in two parts which are summed, and the result is different from the ground truth.
klensink:Meganet$ julia -p 2 () | A fresh approach to technical computing () | () () | Documentation: https://docs.julialang.org | |_ | Type "?help" for help. | | | | | | |/ ` | | | | || | | | (| | | Version 0.6.2 (2017-12-13 18:08 UTC) / |_'|||_'_| | Official http://julialang.org/ release |/ | x86_64-pc-linux-gnu
julia> include(Pkg.dir("Meganet")*"/examples/EResNN_CIFAR10.jl") -- Neural Network -- nLayers: 10 nFeatIn: 3072 nFeatOut: 64 nTheta: 104656 SGD(maxEpochs=200,miniBatch=64,learningRate=0.01,momentum=0.9,nesterov=true,ADAM=false) Using 2 workers... Residual (2 workers) : 0.06707617 Residual (2 workers) : 0.06697881 Residual (2 workers) : 0.042924166 Residual (2 workers) : 0.033434458 Residual (2 workers) : 0.029714787 Residual (2 workers) : 0.05732619 Residual (2 workers) : 0.043897737 Residual (2 workers) : 0.026612915 1 2.02e+00 27.34 2.57e-01 2.04e+00 24.27 @eldadHaber Have I misunderstood what we talked about yesterday? I thought for sure that this shouldn't be a problem.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Works for me, I'll be there at 2:30
Thanks
I should be able to make it as well!
On 21 February 2018 at 23:08, Keegan Lensink notifications@github.com wrote:
Works for me, I'll be there at 2
Thanks
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/XtractOpen/Meganet.jl/issues/61#issuecomment-367561374, or mute the thread https://github.com/notifications/unsubscribe-auth/AEDYB9W2ZuenQmUm4OGIBvBrZMXEcitJks5tXOgsgaJpZM4SOrHq .
Just for traceability...the root of this problem was BATCH NORM. Have since switched to using TV-Norm or "Instance" norm. Need to do more research on the performance tradeoff and implications, but this fixes the parallelization problem.
I found that the parallel version of the code was learning slower than the serial version, and often the loss function was increasing significantly. To track down what was going wrong I rolled way back and implemented the simplest form of synchronous parallelism.
In this current commit all I have done is split each batch, and then sum the derivatives as we discussed the other day. The shuffling is handled by the master and nothing is left to chance, so that we can be certain that each worker is doing half of the work that would have been done if we had just stayed on master. The network is generated with a fixed RNG seed, so the results are reproducible and randomness is not an issue.
The code in question
When I run this version of the code with only 1 worker, the derivative is calculated all on one worker so it exactly matches
test
, a ground truth that is being calculated by master.Now running the same job with two workers, the derivative is calculated in two parts which are summed, and the result is different from the ground truth.
@eldadHaber Have I misunderstood what we talked about yesterday? I thought for sure that this shouldn't be a problem.