Performance Regression with release-0.4

JaredCrean2 commented 8 years ago

I did some benchmarks here and found a factor of 2 slowdown using the release-0.4 compared to an older version of 0.4 (both using --check-bounds=no)

with the old version:

warming up
 183.968 milliseconds
double loop @time printed above
 146.824 milliseconds
ArrayView @time printed above
 146.653 milliseconds
callable object @time printed above
Final testing: 

 196.033 milliseconds
double loop @time printed above
 147.843 milliseconds
ArrayView @time printed above
 146.399 milliseconds
callable object @time printed above

with release-0.4

warming up
  0.184568 seconds
double loop @time printed above
  0.289675 seconds
ArrayView @time printed above
  0.298988 seconds
callable object @time printed above
Final testing: 

  0.183772 seconds
double loop @time printed above
  0.299504 seconds
ArrayView @time printed above
  0.312451 seconds
callable object @time printed above

The old version info

Julia Version 0.4.0-dev+5149
Commit 317a4d1* (2015-06-01 18:58 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7 CPU       Q 820  @ 1.73GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

The new version info

Julia Version 0.4.3-pre+6
Commit adffe19* (2015-12-11 00:38 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7 CPU       Q 820  @ 1.73GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

The slowdown occurs when using ArrayViews, which is also used for the callable object test.

andreasnoack commented 8 years ago

It would be very useful it you could git bisect to a commit that causes this slowdown.

JaredCrean2 commented 8 years ago

Working on it. For some reason I can't build older versions of julia using newer version of the dependencies in some cases. I'll have to do a full build each time, so this may take a little while.

nalimilan commented 8 years ago

If you are on Fedora/RHEL/Centos, or if you can use a VM with one of these distros, you can use the RPM nightlies to go back in time. Should be much faster.

JaredCrean2 commented 8 years ago

There appear to be (at least) two causes. Found one so far:

6457fd3e24fafe65284be7200104a42073c06fa0 is the first bad commit
commit 6457fd3e24fafe65284be7200104a42073c06fa0
Author: Jameson Nash <vtjnash@gmail.com>
Date:   Wed Sep 16 20:15:07 2015 -0400

    in codegen, use StructRet where appropriate

    it seems that the llvm 3.3 return type legalizer may not be properly
    handling large types. this is more efficient anyways.

    fixes #8932, should fix #12163, probably fixes #7434

    (cherry picked from commit 13c83db372527c9c489d751cfa3bd061f8ecd5f0)

:040000 040000 1fecf3eae7e2f50d0ef8651e0cdacc15aba85715 2b1ee880f636763ea0291ece4eea26b78386ef4e M  src
:040000 040000 5468e213495a630a47d837e820dee6c689d0dcc1 cfda1b31bd34e5f632dfcb51a862bcff92b923e6 M  test
bisect run success

This commit causes a slowdown for the ArrayViews test of 0.200 seconds -> 0.292 seconds. Once I find the cause of the 0.146 -> 0.200 seconds slowdown I will post it as well.

JaredCrean2 commented 8 years ago

Found another one, causing 0.146 -> 0.164 seconds slowdown:

dadf2439ee464378a4562e13bc8017859838bf4e is the first bad commit
commit dadf2439ee464378a4562e13bc8017859838bf4e
Author: Jameson Nash <vtjnash@gmail.com>
Date:   Tue Jun 9 20:10:43 2015 -0400

    store tuple and vector types to the stack eagerly

    fix #11187 (pass struct and tuple objects by stack pointer)
    fix #11450 (ccall emission was frobbing the stack)
    likely may fix #11026 and may fix #11003 (ref #10525) invalid stack-read on 32-bit

    this additionally changes the julia specSig calling convention to pass
    non-primitive types by pointer instead of by-value

    this additionally fixes a bug in gen_cfunction that could be exposed by
    turning off specSig

    this additionally moves the alloca calls in ccall (and other places) to
    the entry BasicBlock in the function, ensuring that llvm detects them as
    static allocations and moves them into the function prologue

    this additionally fixes some undefined behavior from changing
    a variable's size through a alloca-cast instead of zext/sext/trunc

    this additionally prepares for turning back on allocating tuples as vectors,
    since the gc now guarantees 16-byte alignment

    future work this makes possible:
     - create a function to replace the jlallocobj_func+init_bits_value call pair (to reduce codegen pressure)
     - allow moving pointers sometimes rather than always copying immutable data
     - teach the GC how it can re-use an existing pointer as a box

:040000 040000 d58dc65194a29d6b6fc925500b6e2c36e2f64ddb 29e6b0192b179341954c294a0e9758a259530d1a M  base
:040000 040000 0d6eb586fcf027f6b3b9f2f425e858a986cd3ffc c0e5b0e44ca807f3e45ee0531f0b5d243c3a2df7 M  src
:040000 040000 d161b92891db26482d38304dd11c73137b8024d8 f1c9480a2b53f01c59cc8c781fbc64063638d461 M  test
bisect run success

JaredCrean2 commented 8 years ago

Found a third one, 0.164 -> 0.242 seconds. I narrowed it down to one of two commits, either c1dc0ec or 9691225. I can't build the first one (julia gives an error while building int.jl), but the second one gives a time of 0.242 seconds. The git logs are:

commit 9691225
Author: David P. Sanders <dpsanders@gmail.com>
Date:   Tue Jun 30 18:40:42 2015 -0400

    Added tests for base/int.jl in new file test/int.jl

    Add tests for @big_str macro

    Added tests for @big_str macro

    Another @big_str test

    Add conversion tests of largest value and (largest+1) into a given type

    Removed comments that were for personal use

    Fix a 32-bit failure

    Another 32-bit problem

    Fixed error

commit c1dc0ec
Author: David P. Sanders <dpsanders@gmail.com>
Date:   Wed Jul 8 15:42:12 2015 -0400

    Remove duplicated definitions of + and < in base/int.jl

I seems there is some future commit that decreased the time 0.242 -> 0.200. Looking for that now.

JaredCrean2 commented 8 years ago

Some changes made between 9691225 and 7908246 caused a time change of 0.242 -> 0.163.

Commit 7908246 caused 0.237 -> 0.164

commit 7908246
Author: Jeff Bezanson <jeff.bezanson@gmail.com>
Date:   Thu Jun 11 18:09:06 2015 -0400

    resolve all globals to GlobalRef very early

    part of #10403

Then commit d8f7c21 then caused 0.164 -> 0.203.

commit d8f7c21
Author: Simon Kornblith <simon@simonster.com>
Date:   Fri Jul 10 17:24:53 2015 -0400

    Inline pointer(x, i)

    Fixes #12054

In summary, we have

317a4d1: elapsed time 0.146 dadf243 : 0.146 -> 0.164 9691225 (or c1dc0ec) : 0.164 -> 0.235 7908246 : 0.237 -> 0.164 d8f7c21 : 0.164 -> 0.202 6457fd3 : 0.202 -> 0.289

andreasnoack commented 8 years ago

cc: @simonster, @vtjnash

andreasnoack commented 8 years ago

Thanks for the work with narrowing it down.

JaredCrean2 commented 8 years ago

Bump,

JaredCrean2 commented 8 years ago

cc: @simonster @vtjnash

JaredCrean2 commented 8 years ago

All the bisecting is done, so I was really hoping someone would look into this. It's holding me up from moving my lab's code onto release-0.4

andreasnoack commented 8 years ago

Have you tried to use slice or sub from Base? I think most people prefer them over this package. Performance issue with SubArrays will probably be higher on the priority list for those who are able to fix issues like these.

JaredCrean2 commented 8 years ago

Looking at some benchmarks below (code here), there is room for improvement in the performance of SubArray compared to the safe ArrayView, but unsafe ArrayViews are still significantly faster than either of them. I think the reason unsafe views were created originally is that there was no way to get performance similar to a double for loop using safe methods, so I don't expect that SubArrays will ever match the performance of unsafe ArrayViews. In order to take small linear slices of arrays efficiently (which is easy to do in C), I think unsafe ArrayViews is the only option.

Old version of julia:

 179.332 milliseconds
double loop @time printed above
 144.578 milliseconds
unsafe ArrayView @time printed above
 403.553 milliseconds (18000 k allocations: 824 MB, 5.55% gc time)
safe ArrayView @time printed above
   4.819 seconds      (89997 k allocations: 2472 MB, 3.18% gc time)
slice @time printed above
 144.899 milliseconds
callable object @time printed above

New version of julia:

  0.178965 seconds
double loop @time printed above
  0.287776 seconds
unsafe ArrayView @time printed above
  0.395486 seconds (18.00 M allocations: 823.975 MB, 6.32% gc time)
safe ArrayView @time printed above
  1.065700 seconds (54.00 M allocations: 1.878 GB, 12.73% gc time)
slice @time printed above
  0.287788 seconds
callable object @time printed above

timholy commented 8 years ago

You probably know this, but your benchmark has little to do with indexing performance and everything to do with construction. For example:

julia> @profile func2a(q, F)

julia> Profile.print()
548 REPL.jl; anonymous; line: 92
 548 REPL.jl; eval_user_input; line: 62
  548 profile.jl; anonymous; line: 16
   6   /tmp/julia_tests/array_speed2/func1.jl; func2a; line: 36
   132 /tmp/julia_tests/array_speed2/func1.jl; func2a; line: 40
   135 /tmp/julia_tests/array_speed2/func1.jl; func2a; line: 41
   264 /tmp/julia_tests/array_speed2/func1.jl; func2a; line: 43
    78 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 145
    87 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 149
    30 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 150
    25 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 151
    20 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 152
    15 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 153
   1   /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 145

julia> Profile.clear()

julia> @profile func3(q, F)

julia> Profile.print()
1507 REPL.jl; anonymous; line: 92
 1507 REPL.jl; eval_user_input; line: 62
  1507 profile.jl; anonymous; line: 16
   2   /tmp/julia_tests/array_speed2/func1.jl; func3; line: 56
   4   /tmp/julia_tests/array_speed2/func1.jl; func3; line: 57
   350 /tmp/julia_tests/array_speed2/func1.jl; func3; line: 58
    29  subarray.jl; _slice; line: 42
    235 subarray.jl; _slice; line: 43
     25 subarray.jl; _slice_unsafe; line: 65
     89 subarray.jl; _slice_unsafe; line: 81
     24 subarray.jl; _slice_unsafe; line: 438
    7   subarray.jl; _slice_unsafe; line: 65
    5   subarray.jl; _slice_unsafe; line: 81
   802 /tmp/julia_tests/array_speed2/func1.jl; func3; line: 59
    59  subarray.jl; _slice; line: 42
    659 subarray.jl; _slice; line: 43
     33  subarray.jl; _slice_unsafe; line: 65
     492 subarray.jl; _slice_unsafe; line: 81
     33  subarray.jl; _slice_unsafe; line: 438
    4   subarray.jl; _slice_unsafe; line: 65
    3   subarray.jl; _slice_unsafe; line: 81
   320 /tmp/julia_tests/array_speed2/func1.jl; func3; line: 60
    107 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 145
    107 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 149
    28  /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 150
    30  /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 151
    21  /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 152
    12  /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 153
   6   /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 145
   9   subarray.jl; _slice; line: 42
   8   subarray.jl; _slice; line: 43

You can see that the part of the code in getEulerFlux shows nearly identical performance.

In constructing many small slices, it's definitely true that our performance is not where we want it to be. The best hope is that Julia's compiler will get good enough that it elides the construction of the object altogether, effectively generating the same code produced by the double-loop method. ArrayViews wins simply because the resulting structure is smaller than a SubArray.

Also note that the "safe" view is not really that safe. For example:

julia> A = rand(3,4,5);

julia> v = view(A, :, 10, 1)
3-element ArrayViews.ContiguousView{Float64,1,Array{Float64,3}}:
 0.55868 
 0.14655 
 0.774752

julia> s = slice(A, :, 10, 1)
ERROR: BoundsError: attempt to access 3x4x5 Array{Float64,3}:
[:, :, 1] =
 0.421446  0.789993   0.859427  0.62459 
 0.286069  0.0124386  0.347475  0.814826
 0.598649  0.0719514  0.552666  0.605439

[:, :, 2] =
 0.440828  0.88299    0.632718  0.761264
 0.85906   0.0755092  0.643793  0.954605
 0.51425   0.822237   0.264295  0.202285

[:, :, 3] =
 0.465205   0.55868   0.712503   0.294909
 0.618646   0.14655   0.427141   0.52394 
 0.0494494  0.774752  0.0379797  0.120516

[:, :, 4] =
 0.738384  0.0899148  0.330653  0.25026 
 0.334319  0.917572   0.753121  0.101671
 0.874826  0.742816   0.659517  0.492652

[:, :, 5] =
 0.12097    0.59091   0.535681  0.423868
 0.0773069  0.304412  0.54288   0.473162
 0.463124   0.127348  0.959255  0.621535
  at index [Colon(),10,1]
 in throw_boundserror at abstractarray.jl:156
 in _internal_checkbounds at abstractarray.jl:176
 in checkbounds at abstractarray.jl:159
 in slice at subarray.jl:39

You get somewhat better performance by calling Base._slice_unsafe, which skips the bounds check.

If we force-inline construction, it looks like we could get another 10% (making the performance 1.5x that of the double-loop, on my machine), but I'm not certain it's worth it.

JaredCrean2 commented 8 years ago

That's interesting (I though there was an extra pointer chase using regular array indexing rather than unsafe_load, but it was actually from the object not getting elided). So it looks like there is hope for a safe, efficient view mechanism.

Should I file a separate issue for elision of small object? Its mentioned in the Arraypocalypse Now issue, but it might be a separable task, and I have some code here that would make a decent test case.

What conditions produced slice being 1.5x the double loop? I added both slice and _unsafe_slice to the testing (new code pushed here), and I am seeing approximately 4x for _unsafe_slice (using --check-bounds=no):

  0.181120 seconds
double loop @time printed above
  0.292330 seconds
unsafe ArrayView @time printed above
  0.394143 seconds (18.00 M allocations: 823.975 MB, 5.10% gc time)
safe ArrayView @time printed above
  1.041210 seconds (54.00 M allocations: 1.878 GB, 8.14% gc time)
slice @time printed above
  0.717231 seconds (36.00 M allocations: 1.341 GB, 8.14% gc time)
unsafe slice @time printed above
  0.287299 seconds
callable object @time printed above

JaredCrean2 commented 8 years ago

Is it possible to run the benchmark suite on the two offending commits (even if they are significantly older than the benchmark suite itself?). I suspect they affect other parts of Julia, which might move them up on the priority list.

andreasnoack commented 6 years ago

Is this still relevant to keep open?

JaredCrean2 commented 6 years ago

This particular case has been fixed on Julia 0.6. I did find a different case on 0.6 where a function applied to an aview is 2x slower than when applied to a regular Array. The 2x slowdown is also present for Base.view

using ArrayViews

function test{T,Tflx,Tres}(Q::AbstractArray{T, 2},
                           flux::AbstractArray{Tflx,3},
                           res::AbstractArray{Tres,3})

  for elem = 1:size(flux,3)
    for i = 1:size(Q, 2)
      for j = 1:size(Q, 1)
        for field = 1:size(flux,1)
          res[field,i,elem] += Q[j,i]*flux[field,j,elem]
        end
      end
    end
  end

  return nothing
end

numDofPerNode = 4
numNodesPerElement = 30
numEl = 5000

# input arrays
Q = rand(numNodesPerElement, numNodesPerElement)
flux = rand(numDofPerNode, numNodesPerElement, numEl)
# create second flux array with extra dimension, then make a view
flux2  = rand(numDofPerNode, numNodesPerElement, numEl, 2)
flux2_view = aview(flux2, :, :, :, 1)
flux3_view = Base.view(flux2, :, :, :, 1)

# output array
res = zeros(numDofPerNode, numNodesPerElement, numEl)

# warm up
@time test(Q, flux, res)
@time test(Q, flux2_view, res)
@time test(Q, flux3_view, res)

println("final timings:")
@time test(Q, flux, res)
@time test(Q, flux2_view, res)
@time test(Q, flux3_view, res)

Output:

creanj@Excelsior:/tmp$ julia -O3 ./tmp.jl 
  0.043399 seconds (5.89 k allocations: 306.742 KiB)
  0.076066 seconds (9.59 k allocations: 526.231 KiB)
  0.082924 seconds (30.89 k allocations: 1.579 MiB)
final timings:
  0.025075 seconds (4 allocations: 160 bytes)
  0.055673 seconds (4 allocations: 160 bytes)
  0.052890 seconds (4 allocations: 160 bytes)

EDIT: forgot versioninfo()

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40* (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Prescott)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)

JaredCrean2 commented 6 years ago

The Base.view regression is fixed on master. Closing.

JuliaArrays / ArrayViews.jl

Performance Regression with release-0.4 #41