Closed multidis closed 8 years ago
Solving linear systems will use multiple cores unless you call. Base.BLAS.set_num_threads
(if I remember the function name correctly).
As mentioned above, external calls from Julia to another library will not honor '-p'. Are you absolutely sure it is the same version of Optim?
Are you absolutely sure it is the same version of Optim?
Yes, I use the same fork with both Julia 0.4 and 0.5. The performance degradation situation actually reproduces over a range of Optim commits from the last 2 months.
Trying to look into BLAS threads now...
julia> versioninfo() Julia Version 0.5.0 Commit 3c9d753 (2016-09-19 18:14 UTC) Platform Info: System: Linux (x86_64-linux-gnu) CPU: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz WORD_SIZE: 64 BLAS: libopenblas (NO_LAPACK NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell) LAPACK: liblapack.so.3 LIBM: libopenlibm LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
Tried with Base.BLAS.set_num_threads(1)
in Julia 0.5. Same results. Single-core run seems to stay single-core, so it must have been BLAS multithreading that loaded multiple cores. However, the multicore call even with @everywhere Base.BLAS.set_num_threads(1)
stays as slow as before.
Correction: Julia 0.5 also writes
WARNING: could not attach metadata for @simd loop.
Tried removing all @simd
in Optim sources, that speeds up by a small factor, still much slower than Julia 0.4 with the same simd-warning.
Perhaps there is more to it than just forcing Optim to use a single worker core.
An interesting inverse scaling:
julia
2.494269 seconds (1.90 M allocations: 80.106 MB, 0.55% gc time)
julia -p 2
11.449032 seconds (1.20 k allocations: 140.594 KB)
julia -p 4
16.890558 seconds (1.22 k allocations: 153.734 KB)
julia -p 8
20.968143 seconds (1.25 k allocations: 158.594 KB)
Can you post some sample code that I can test with?
@amitmurthy Here is a simple mockup of what I am running: https://gist.github.com/multidis/660adbcb2e3b0df2e07436c75b5403a0
Note that in this particular case the list that pmap
iterates has only 3 elements, but in practice it may vary from 2 to 1000's of elements. Run times with the actual, full code are longer in the case of v0.5 for larger lists as well (that observation is difficult to reproduce with a simplified mockup code).
Julia v0.4 runs (second @time call):
~/julia04 pmap_optim_time.jl WARNING: could not attach metadata for @simd loop. 0.006862 seconds (21.71 k allocations: 8.112 MB)
~/julia04 -p 8 pmap_optim_time.jl WARNING: could not attach metadata for @simd loop. WARNING: could not attach metadata for @simd loop. WARNING: could not attach metadata for @simd loop. 0.007946 seconds (2.47 k allocations: 7.827 MB)
Julia v0.5 runs (second @time call):
julia pmap_optim_time.jl WARNING: could not attach metadata for @simd loop. 0.008386 seconds (14.66 k allocations: 7.944 MB)
julia -p 8 pmap_optim_time.jl WARNING: could not attach metadata for @simd loop. WARNING: could not attach metadata for @simd loop. WARNING: could not attach metadata for @simd loop. (first call results) WARNING: could not attach metadata for @simd loop. WARNING: could not attach metadata for @simd loop. WARNING: could not attach metadata for @simd loop. 4.073709 seconds (2.31 k allocations: 7.731 MB)
Note also that simd-warnings are generated for the second function call as well, as opposed to v0.4 behaviour.
Version specifics:
~/julia04 -e "versioninfo()" Julia Version 0.4.7 Commit ae26b25 (2016-09-18 16:17 UTC) Platform Info: System: Linux (x86_64-pc-linux-gnu) CPU: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NOAFFINITY Haswell) LAPACK: libopenblas64 LIBM: libopenlibm LLVM: libLLVM-3.3
julia -e "versioninfo()" Julia Version 0.5.0 Commit 3c9d753 (2016-09-19 18:14 UTC) Platform Info: System: Linux (x86_64-linux-gnu) CPU: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz WORD_SIZE: 64 BLAS: libopenblas (NO_LAPACK NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell) LAPACK: liblapack.so.3 LIBM: libopenlibm LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
The numbers above are the first-time compilation overhead. I just added a few more runs of frunmap
to the test script.
Ntimes=3
julia -p 8 --depwarn=no pmap_optim_time.jl
Any[5.60283e-17,5.60283e-17,5.60283e-17]
5.641841 seconds (5.99 M allocations: 262.483 MB, 3.54% gc time)
2.350770 seconds (2.25 k allocations: 7.729 MB, 0.26% gc time)
2.305116 seconds (1.70 k allocations: 7.715 MB)
0.019322 seconds (620 allocations: 7.688 MB)
0.024080 seconds (629 allocations: 7.689 MB, 21.92% gc time)
0.015221 seconds (619 allocations: 7.688 MB)
Ntimes=8
julia -p 8 --depwarn=no pmap_optim_time.jl
Any[5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17]
7.750557 seconds (5.99 M allocations: 262.648 MB, 3.01% gc time)
0.029905 seconds (1.44 k allocations: 7.766 MB, 19.75% gc time)
0.014826 seconds (1.46 k allocations: 7.768 MB)
0.014171 seconds (1.44 k allocations: 7.767 MB)
0.019212 seconds (1.48 k allocations: 7.769 MB, 25.70% gc time)
0.012541 seconds (1.45 k allocations: 7.767 MB)
In 0.5 all the workers are being used in a round-robin fashion. With Ntimes=3, the first runs primes workers 1-3, the second 4-6 and the third 7-8. Only from the fourth run onwards you see the optimized timings. With Ntimes=8, the first run itself causes all the workers to be warmed up.
In 0.4 only the first 3 workers are being used everytime and hence you see fast run times from the second run itself.
Run times with the actual, full code are longer in the case of v0.5 for larger lists as well (that observation is difficult to reproduce with a simplified mockup code).
How much longer? There is a slight slowdown due to changes in closure serialization. A http://docs.julialang.org/en/latest/stdlib/parallel/#Base.CachingPool may help if that is the cause.
Only from the fourth run onwards you see the optimized timings.
This explains a lot of my confusion, thanks @amitmurthy
After extensive testing taking this into account, Julia 0.5 slowdown (compared to 0.4) is about 5 to 10% extra time. Interestingly, the memory allocation is half of what it was in 0.4. Win on memory, loose on closures I guess... CachingPool
does not seem to affect run times.
I guess I'll stay with 0.4 in production to avoid additional computing costs, for as long as package dependencies work with 0.4. Closing the issue since apparently nothing was found specific to Optim.jl
.
Hm, you say closures... Is this related to https://github.com/JuliaLang/julia/issues/15276 as well? Maybe when that is fixed, performance will make v0.5 interesting for you again.
Could be, frankly I had not had a chance to dig into closure issues for 0.5 in detail. Will definitely time with each subsequent 0.5.* update.
Looking into another performance degradation possibly related to MultivariateOptimizationResults{T,N}
to MultivariateOptimizationResults{T,N,M}
change, but not sure yet so keeping this closed for now.
Moving this to a new issue https://github.com/JuliaOpt/Optim.jl/issues/292
Use case:
pmap
with a function that uses Optim within its body.Problem:
Julia 0.4
@time
results for pmap-call:Julia 0.5:
Thus, the multicore pmap call is 18 times slower on Julia 0.5.
It appears that the problem is related to
Optim.jl
; other parts of the function body applied viapmap
do not produce this slowdown. Starting Julia with even a single core (julia
, notjulia -p 8
) and tracking CPU load I noticed that "single core" run still loads multiple CPU cores, intermittently. Not sure if this is some type of multi-threading in Julia 0.5 or something else that was disabled in 0.4.Tried
fminbox
withcg
, tried also unconstrained Nelder-Mead. Same problem. Tried removing all@simd
in Optim sources without much change. Note that Julia printsWARNING: could not attach metadata for @simd loop.
https://github.com/JuliaOpt/Optim.jl/pull/157#issuecomment-234475692Any suggestions on how to force Optim to use a single worker core? Or did I misunderstand what is happening?