Dramatic performance degradation in Julia 0.5

multidis commented 8 years ago

Use case: pmap with a function that uses Optim within its body.

Problem:

Julia 0.4 @time results for pmap-call:

single core time 3s, 2M allocations, 90MB, 0.57% gc time
8 cores time 1s, 5k allocations, 500KB

Julia 0.5:

single core time 3s, 2M allocations, 80MB, 0.71% gc time
8 cores time 18s, 1k allocations, 150KB

Thus, the multicore pmap call is 18 times slower on Julia 0.5.

It appears that the problem is related to Optim.jl; other parts of the function body applied via pmap do not produce this slowdown. Starting Julia with even a single core (julia, not julia -p 8) and tracking CPU load I noticed that "single core" run still loads multiple CPU cores, intermittently. Not sure if this is some type of multi-threading in Julia 0.5 or something else that was disabled in 0.4.

Tried fminbox with cg, tried also unconstrained Nelder-Mead. Same problem. Tried removing all @simd in Optim sources without much change. Note that Julia prints

WARNING: could not attach metadata for @simd loop. https://github.com/JuliaOpt/Optim.jl/pull/157#issuecomment-234475692

Any suggestions on how to force Optim to use a single worker core? Or did I misunderstand what is happening?

KristofferC commented 8 years ago

Solving linear systems will use multiple cores unless you call. Base.BLAS.set_num_threads (if I remember the function name correctly).

pkofod commented 8 years ago

As mentioned above, external calls from Julia to another library will not honor '-p'. Are you absolutely sure it is the same version of Optim?

multidis commented 8 years ago

Are you absolutely sure it is the same version of Optim?

Yes, I use the same fork with both Julia 0.4 and 0.5. The performance degradation situation actually reproduces over a range of Optim commits from the last 2 months.

Trying to look into BLAS threads now...

multidis commented 8 years ago

julia> versioninfo() Julia Version 0.5.0 Commit 3c9d753 (2016-09-19 18:14 UTC) Platform Info: System: Linux (x86_64-linux-gnu) CPU: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz WORD_SIZE: 64 BLAS: libopenblas (NO_LAPACK NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell) LAPACK: liblapack.so.3 LIBM: libopenlibm LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

multidis commented 8 years ago

Tried with Base.BLAS.set_num_threads(1) in Julia 0.5. Same results. Single-core run seems to stay single-core, so it must have been BLAS multithreading that loaded multiple cores. However, the multicore call even with @everywhere Base.BLAS.set_num_threads(1) stays as slow as before.

multidis commented 8 years ago

Correction: Julia 0.5 also writes

WARNING: could not attach metadata for @simd loop.

Tried removing all @simd in Optim sources, that speeds up by a small factor, still much slower than Julia 0.4 with the same simd-warning.

Perhaps there is more to it than just forcing Optim to use a single worker core.

multidis commented 8 years ago

An interesting inverse scaling:

julia 2.494269 seconds (1.90 M allocations: 80.106 MB, 0.55% gc time)

julia -p 2 11.449032 seconds (1.20 k allocations: 140.594 KB)

julia -p 4 16.890558 seconds (1.22 k allocations: 153.734 KB)

julia -p 8 20.968143 seconds (1.25 k allocations: 158.594 KB)

amitmurthy commented 8 years ago

Can you post some sample code that I can test with?

multidis commented 8 years ago

@amitmurthy Here is a simple mockup of what I am running: https://gist.github.com/multidis/660adbcb2e3b0df2e07436c75b5403a0

Note that in this particular case the list that pmap iterates has only 3 elements, but in practice it may vary from 2 to 1000's of elements. Run times with the actual, full code are longer in the case of v0.5 for larger lists as well (that observation is difficult to reproduce with a simplified mockup code).

Julia v0.4 runs (second @time call):

~/julia04 pmap_optim_time.jl WARNING: could not attach metadata for @simd loop. 0.006862 seconds (21.71 k allocations: 8.112 MB)

~/julia04 -p 8 pmap_optim_time.jl WARNING: could not attach metadata for @simd loop. WARNING: could not attach metadata for @simd loop. WARNING: could not attach metadata for @simd loop. 0.007946 seconds (2.47 k allocations: 7.827 MB)

Julia v0.5 runs (second @time call):

julia pmap_optim_time.jl WARNING: could not attach metadata for @simd loop. 0.008386 seconds (14.66 k allocations: 7.944 MB)

julia -p 8 pmap_optim_time.jl WARNING: could not attach metadata for @simd loop. WARNING: could not attach metadata for @simd loop. WARNING: could not attach metadata for @simd loop. (first call results) WARNING: could not attach metadata for @simd loop. WARNING: could not attach metadata for @simd loop. WARNING: could not attach metadata for @simd loop. 4.073709 seconds (2.31 k allocations: 7.731 MB)

Note also that simd-warnings are generated for the second function call as well, as opposed to v0.4 behaviour.

Version specifics:

~/julia04 -e "versioninfo()" Julia Version 0.4.7 Commit ae26b25 (2016-09-18 16:17 UTC) Platform Info: System: Linux (x86_64-pc-linux-gnu) CPU: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NOAFFINITY Haswell) LAPACK: libopenblas64 LIBM: libopenlibm LLVM: libLLVM-3.3

julia -e "versioninfo()" Julia Version 0.5.0 Commit 3c9d753 (2016-09-19 18:14 UTC) Platform Info: System: Linux (x86_64-linux-gnu) CPU: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz WORD_SIZE: 64 BLAS: libopenblas (NO_LAPACK NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell) LAPACK: liblapack.so.3 LIBM: libopenlibm LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

amitmurthy commented 8 years ago

The numbers above are the first-time compilation overhead. I just added a few more runs of frunmap to the test script.

Ntimes=3

julia -p 8 --depwarn=no pmap_optim_time.jl 
Any[5.60283e-17,5.60283e-17,5.60283e-17]
  5.641841 seconds (5.99 M allocations: 262.483 MB, 3.54% gc time)
  2.350770 seconds (2.25 k allocations: 7.729 MB, 0.26% gc time)
  2.305116 seconds (1.70 k allocations: 7.715 MB)
  0.019322 seconds (620 allocations: 7.688 MB)
  0.024080 seconds (629 allocations: 7.689 MB, 21.92% gc time)
  0.015221 seconds (619 allocations: 7.688 MB)

Ntimes=8

julia -p 8 --depwarn=no pmap_optim_time.jl 
Any[5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17,5.60283e-17]
  7.750557 seconds (5.99 M allocations: 262.648 MB, 3.01% gc time)
  0.029905 seconds (1.44 k allocations: 7.766 MB, 19.75% gc time)
  0.014826 seconds (1.46 k allocations: 7.768 MB)
  0.014171 seconds (1.44 k allocations: 7.767 MB)
  0.019212 seconds (1.48 k allocations: 7.769 MB, 25.70% gc time)
  0.012541 seconds (1.45 k allocations: 7.767 MB)

In 0.5 all the workers are being used in a round-robin fashion. With Ntimes=3, the first runs primes workers 1-3, the second 4-6 and the third 7-8. Only from the fourth run onwards you see the optimized timings. With Ntimes=8, the first run itself causes all the workers to be warmed up.

In 0.4 only the first 3 workers are being used everytime and hence you see fast run times from the second run itself.

Run times with the actual, full code are longer in the case of v0.5 for larger lists as well (that observation is difficult to reproduce with a simplified mockup code).

How much longer? There is a slight slowdown due to changes in closure serialization. A http://docs.julialang.org/en/latest/stdlib/parallel/#Base.CachingPool may help if that is the cause.

Also see https://github.com/JuliaLang/julia/issues/16508

multidis commented 8 years ago

Only from the fourth run onwards you see the optimized timings.

This explains a lot of my confusion, thanks @amitmurthy

After extensive testing taking this into account, Julia 0.5 slowdown (compared to 0.4) is about 5 to 10% extra time. Interestingly, the memory allocation is half of what it was in 0.4. Win on memory, loose on closures I guess... CachingPool does not seem to affect run times.

I guess I'll stay with 0.4 in production to avoid additional computing costs, for as long as package dependencies work with 0.4. Closing the issue since apparently nothing was found specific to Optim.jl.

pkofod commented 8 years ago

Hm, you say closures... Is this related to https://github.com/JuliaLang/julia/issues/15276 as well? Maybe when that is fixed, performance will make v0.5 interesting for you again.

multidis commented 8 years ago

Could be, frankly I had not had a chance to dig into closure issues for 0.5 in detail. Will definitely time with each subsequent 0.5.* update.

Looking into another performance degradation possibly related to MultivariateOptimizationResults{T,N} to MultivariateOptimizationResults{T,N,M} change, but not sure yet so keeping this closed for now.

multidis commented 8 years ago

Moving this to a new issue https://github.com/JuliaOpt/Optim.jl/issues/292

JuliaNLSolvers / Optim.jl

Dramatic performance degradation in Julia 0.5 #290