Closed JaredCrean2 closed 6 years ago
It would be very useful it you could git bisect to a commit that causes this slowdown.
Working on it. For some reason I can't build older versions of julia using newer version of the dependencies in some cases. I'll have to do a full build each time, so this may take a little while.
If you are on Fedora/RHEL/Centos, or if you can use a VM with one of these distros, you can use the RPM nightlies to go back in time. Should be much faster.
There appear to be (at least) two causes. Found one so far:
6457fd3e24fafe65284be7200104a42073c06fa0 is the first bad commit
commit 6457fd3e24fafe65284be7200104a42073c06fa0
Author: Jameson Nash <vtjnash@gmail.com>
Date: Wed Sep 16 20:15:07 2015 -0400
in codegen, use StructRet where appropriate
it seems that the llvm 3.3 return type legalizer may not be properly
handling large types. this is more efficient anyways.
fixes #8932, should fix #12163, probably fixes #7434
(cherry picked from commit 13c83db372527c9c489d751cfa3bd061f8ecd5f0)
:040000 040000 1fecf3eae7e2f50d0ef8651e0cdacc15aba85715 2b1ee880f636763ea0291ece4eea26b78386ef4e M src
:040000 040000 5468e213495a630a47d837e820dee6c689d0dcc1 cfda1b31bd34e5f632dfcb51a862bcff92b923e6 M test
bisect run success
This commit causes a slowdown for the ArrayViews test of 0.200 seconds -> 0.292 seconds. Once I find the cause of the 0.146 -> 0.200 seconds slowdown I will post it as well.
Found another one, causing 0.146 -> 0.164 seconds slowdown:
dadf2439ee464378a4562e13bc8017859838bf4e is the first bad commit
commit dadf2439ee464378a4562e13bc8017859838bf4e
Author: Jameson Nash <vtjnash@gmail.com>
Date: Tue Jun 9 20:10:43 2015 -0400
store tuple and vector types to the stack eagerly
fix #11187 (pass struct and tuple objects by stack pointer)
fix #11450 (ccall emission was frobbing the stack)
likely may fix #11026 and may fix #11003 (ref #10525) invalid stack-read on 32-bit
this additionally changes the julia specSig calling convention to pass
non-primitive types by pointer instead of by-value
this additionally fixes a bug in gen_cfunction that could be exposed by
turning off specSig
this additionally moves the alloca calls in ccall (and other places) to
the entry BasicBlock in the function, ensuring that llvm detects them as
static allocations and moves them into the function prologue
this additionally fixes some undefined behavior from changing
a variable's size through a alloca-cast instead of zext/sext/trunc
this additionally prepares for turning back on allocating tuples as vectors,
since the gc now guarantees 16-byte alignment
future work this makes possible:
- create a function to replace the jlallocobj_func+init_bits_value call pair (to reduce codegen pressure)
- allow moving pointers sometimes rather than always copying immutable data
- teach the GC how it can re-use an existing pointer as a box
:040000 040000 d58dc65194a29d6b6fc925500b6e2c36e2f64ddb 29e6b0192b179341954c294a0e9758a259530d1a M base
:040000 040000 0d6eb586fcf027f6b3b9f2f425e858a986cd3ffc c0e5b0e44ca807f3e45ee0531f0b5d243c3a2df7 M src
:040000 040000 d161b92891db26482d38304dd11c73137b8024d8 f1c9480a2b53f01c59cc8c781fbc64063638d461 M test
bisect run success
Found a third one, 0.164 -> 0.242 seconds. I narrowed it down to one of two commits, either c1dc0ec
or 9691225
. I can't build the first one (julia gives an error while building int.jl), but the second one gives a time of 0.242 seconds. The git logs are:
commit 9691225
Author: David P. Sanders <dpsanders@gmail.com>
Date: Tue Jun 30 18:40:42 2015 -0400
Added tests for base/int.jl in new file test/int.jl
Add tests for @big_str macro
Added tests for @big_str macro
Another @big_str test
Add conversion tests of largest value and (largest+1) into a given type
Removed comments that were for personal use
Fix a 32-bit failure
Another 32-bit problem
Fixed error
commit c1dc0ec
Author: David P. Sanders <dpsanders@gmail.com>
Date: Wed Jul 8 15:42:12 2015 -0400
Remove duplicated definitions of + and < in base/int.jl
I seems there is some future commit that decreased the time 0.242 -> 0.200. Looking for that now.
Some changes made between 9691225
and 7908246
caused a time change of 0.242 -> 0.163.
Commit 7908246 caused 0.237 -> 0.164
commit 7908246
Author: Jeff Bezanson <jeff.bezanson@gmail.com>
Date: Thu Jun 11 18:09:06 2015 -0400
resolve all globals to GlobalRef very early
part of #10403
Then commit d8f7c21
then caused 0.164 -> 0.203.
commit d8f7c21
Author: Simon Kornblith <simon@simonster.com>
Date: Fri Jul 10 17:24:53 2015 -0400
Inline pointer(x, i)
Fixes #12054
In summary, we have
317a4d1: elapsed time 0.146 dadf243 : 0.146 -> 0.164 9691225 (or c1dc0ec) : 0.164 -> 0.235 7908246 : 0.237 -> 0.164 d8f7c21 : 0.164 -> 0.202 6457fd3 : 0.202 -> 0.289
cc: @simonster, @vtjnash
Thanks for the work with narrowing it down.
Bump,
cc: @simonster @vtjnash
All the bisecting is done, so I was really hoping someone would look into this. It's holding me up from moving my lab's code onto release-0.4
Have you tried to use slice
or sub
from Base
? I think most people prefer them over this package. Performance issue with SubArray
s will probably be higher on the priority list for those who are able to fix issues like these.
Looking at some benchmarks below (code here), there is room for improvement in the performance of SubArray
compared to the safe ArrayView
, but unsafe ArrayView
s are still significantly faster than either of them. I think the reason unsafe views were created originally is that there was no way to get performance similar to a double for loop using safe methods, so I don't expect that SubArray
s will ever match the performance of unsafe ArrayViews
. In order to take small linear slices of arrays efficiently (which is easy to do in C), I think unsafe ArrayViews
is the only option.
Old version of julia:
179.332 milliseconds
double loop @time printed above
144.578 milliseconds
unsafe ArrayView @time printed above
403.553 milliseconds (18000 k allocations: 824 MB, 5.55% gc time)
safe ArrayView @time printed above
4.819 seconds (89997 k allocations: 2472 MB, 3.18% gc time)
slice @time printed above
144.899 milliseconds
callable object @time printed above
New version of julia:
0.178965 seconds
double loop @time printed above
0.287776 seconds
unsafe ArrayView @time printed above
0.395486 seconds (18.00 M allocations: 823.975 MB, 6.32% gc time)
safe ArrayView @time printed above
1.065700 seconds (54.00 M allocations: 1.878 GB, 12.73% gc time)
slice @time printed above
0.287788 seconds
callable object @time printed above
You probably know this, but your benchmark has little to do with indexing performance and everything to do with construction. For example:
julia> @profile func2a(q, F)
julia> Profile.print()
548 REPL.jl; anonymous; line: 92
548 REPL.jl; eval_user_input; line: 62
548 profile.jl; anonymous; line: 16
6 /tmp/julia_tests/array_speed2/func1.jl; func2a; line: 36
132 /tmp/julia_tests/array_speed2/func1.jl; func2a; line: 40
135 /tmp/julia_tests/array_speed2/func1.jl; func2a; line: 41
264 /tmp/julia_tests/array_speed2/func1.jl; func2a; line: 43
78 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 145
87 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 149
30 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 150
25 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 151
20 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 152
15 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 153
1 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 145
julia> Profile.clear()
julia> @profile func3(q, F)
julia> Profile.print()
1507 REPL.jl; anonymous; line: 92
1507 REPL.jl; eval_user_input; line: 62
1507 profile.jl; anonymous; line: 16
2 /tmp/julia_tests/array_speed2/func1.jl; func3; line: 56
4 /tmp/julia_tests/array_speed2/func1.jl; func3; line: 57
350 /tmp/julia_tests/array_speed2/func1.jl; func3; line: 58
29 subarray.jl; _slice; line: 42
235 subarray.jl; _slice; line: 43
25 subarray.jl; _slice_unsafe; line: 65
89 subarray.jl; _slice_unsafe; line: 81
24 subarray.jl; _slice_unsafe; line: 438
7 subarray.jl; _slice_unsafe; line: 65
5 subarray.jl; _slice_unsafe; line: 81
802 /tmp/julia_tests/array_speed2/func1.jl; func3; line: 59
59 subarray.jl; _slice; line: 42
659 subarray.jl; _slice; line: 43
33 subarray.jl; _slice_unsafe; line: 65
492 subarray.jl; _slice_unsafe; line: 81
33 subarray.jl; _slice_unsafe; line: 438
4 subarray.jl; _slice_unsafe; line: 65
3 subarray.jl; _slice_unsafe; line: 81
320 /tmp/julia_tests/array_speed2/func1.jl; func3; line: 60
107 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 145
107 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 149
28 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 150
30 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 151
21 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 152
12 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 153
6 /tmp/julia_tests/array_speed2/func1.jl; getEulerFlux; line: 145
9 subarray.jl; _slice; line: 42
8 subarray.jl; _slice; line: 43
You can see that the part of the code in getEulerFlux
shows nearly identical performance.
In constructing many small slices, it's definitely true that our performance is not where we want it to be. The best hope is that Julia's compiler will get good enough that it elides the construction of the object altogether, effectively generating the same code produced by the double-loop method. ArrayViews wins simply because the resulting structure is smaller than a SubArray.
Also note that the "safe" view is not really that safe. For example:
julia> A = rand(3,4,5);
julia> v = view(A, :, 10, 1)
3-element ArrayViews.ContiguousView{Float64,1,Array{Float64,3}}:
0.55868
0.14655
0.774752
julia> s = slice(A, :, 10, 1)
ERROR: BoundsError: attempt to access 3x4x5 Array{Float64,3}:
[:, :, 1] =
0.421446 0.789993 0.859427 0.62459
0.286069 0.0124386 0.347475 0.814826
0.598649 0.0719514 0.552666 0.605439
[:, :, 2] =
0.440828 0.88299 0.632718 0.761264
0.85906 0.0755092 0.643793 0.954605
0.51425 0.822237 0.264295 0.202285
[:, :, 3] =
0.465205 0.55868 0.712503 0.294909
0.618646 0.14655 0.427141 0.52394
0.0494494 0.774752 0.0379797 0.120516
[:, :, 4] =
0.738384 0.0899148 0.330653 0.25026
0.334319 0.917572 0.753121 0.101671
0.874826 0.742816 0.659517 0.492652
[:, :, 5] =
0.12097 0.59091 0.535681 0.423868
0.0773069 0.304412 0.54288 0.473162
0.463124 0.127348 0.959255 0.621535
at index [Colon(),10,1]
in throw_boundserror at abstractarray.jl:156
in _internal_checkbounds at abstractarray.jl:176
in checkbounds at abstractarray.jl:159
in slice at subarray.jl:39
You get somewhat better performance by calling Base._slice_unsafe
, which skips the bounds check.
If we force-inline construction, it looks like we could get another 10% (making the performance 1.5x that of the double-loop, on my machine), but I'm not certain it's worth it.
That's interesting (I though there was an extra pointer chase using regular array indexing rather than unsafe_load
, but it was actually from the object not getting elided). So it looks like there is hope for a safe, efficient view mechanism.
Should I file a separate issue for elision of small object? Its mentioned in the Arraypocalypse Now issue, but it might be a separable task, and I have some code here that would make a decent test case.
What conditions produced slice
being 1.5x the double loop? I added both slice
and _unsafe_slice
to the testing (new code pushed here), and I am seeing approximately 4x for _unsafe_slice
(using --check-bounds=no
):
0.181120 seconds
double loop @time printed above
0.292330 seconds
unsafe ArrayView @time printed above
0.394143 seconds (18.00 M allocations: 823.975 MB, 5.10% gc time)
safe ArrayView @time printed above
1.041210 seconds (54.00 M allocations: 1.878 GB, 8.14% gc time)
slice @time printed above
0.717231 seconds (36.00 M allocations: 1.341 GB, 8.14% gc time)
unsafe slice @time printed above
0.287299 seconds
callable object @time printed above
Is it possible to run the benchmark suite on the two offending commits (even if they are significantly older than the benchmark suite itself?). I suspect they affect other parts of Julia, which might move them up on the priority list.
Is this still relevant to keep open?
This particular case has been fixed on Julia 0.6. I did find a different case on 0.6 where a function applied to an aview
is 2x slower than when applied to a regular Array
. The 2x slowdown is also present for Base.view
using ArrayViews
function test{T,Tflx,Tres}(Q::AbstractArray{T, 2},
flux::AbstractArray{Tflx,3},
res::AbstractArray{Tres,3})
for elem = 1:size(flux,3)
for i = 1:size(Q, 2)
for j = 1:size(Q, 1)
for field = 1:size(flux,1)
res[field,i,elem] += Q[j,i]*flux[field,j,elem]
end
end
end
end
return nothing
end
numDofPerNode = 4
numNodesPerElement = 30
numEl = 5000
# input arrays
Q = rand(numNodesPerElement, numNodesPerElement)
flux = rand(numDofPerNode, numNodesPerElement, numEl)
# create second flux array with extra dimension, then make a view
flux2 = rand(numDofPerNode, numNodesPerElement, numEl, 2)
flux2_view = aview(flux2, :, :, :, 1)
flux3_view = Base.view(flux2, :, :, :, 1)
# output array
res = zeros(numDofPerNode, numNodesPerElement, numEl)
# warm up
@time test(Q, flux, res)
@time test(Q, flux2_view, res)
@time test(Q, flux3_view, res)
println("final timings:")
@time test(Q, flux, res)
@time test(Q, flux2_view, res)
@time test(Q, flux3_view, res)
Output:
creanj@Excelsior:/tmp$ julia -O3 ./tmp.jl
0.043399 seconds (5.89 k allocations: 306.742 KiB)
0.076066 seconds (9.59 k allocations: 526.231 KiB)
0.082924 seconds (30.89 k allocations: 1.579 MiB)
final timings:
0.025075 seconds (4 allocations: 160 bytes)
0.055673 seconds (4 allocations: 160 bytes)
0.052890 seconds (4 allocations: 160 bytes)
EDIT: forgot versioninfo()
julia> versioninfo()
Julia Version 0.6.2
Commit d386e40* (2017-12-13 18:08 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Prescott)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)
The Base.view
regression is fixed on master. Closing.
I did some benchmarks here and found a factor of 2 slowdown using the release-0.4 compared to an older version of 0.4 (both using
--check-bounds=no
)with the old version:
with release-0.4
The old version info
The new version info
The slowdown occurs when using ArrayViews, which is also used for the callable object test.