Copying sub-arrays from Jacobians

ericagol commented 3 years ago

To use BLAS for Jacobian multiplication, we first copy from the full jacobian the portion for the two bodies involved in a drift-kepler step:

https://github.com/ericagol/NbodyGradient/blob/c59e961cec6c2de1386737cb613e043775a606ab/src/utils.jl#L48-L100

and then copy back after the multiplication is complete.

These routines take about 50% of the processing time, and is much slower than the actual BLAS multiplication! Which is weird since it is just copying a portion of a matrix into another.

I'm wondering if there is some way to speed this up - right now we are probably doing something inefficiently? It also may be worth looking into LazyArrays: https://github.com/JuliaArrays/LazyArrays.jl

Any ideas @langfzac @giordano?

giordano commented 3 years ago

Can I see a reproducible example, with the input arguments for these functions, so that I can profile them? One gotcha is that slicing in Julia creates a copy, instead of taking a view (see https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-views), but as far as I can see here you're copying the arrays elementwise, which should be ok.

In principle copying elements of arrays should be relatively fast

julia> using BenchmarkTools

julia> function copy_array!(dest, src)
           for idx in eachindex(src)
               @inbounds dest[idx] = src[idx]
           end
           return dest
       end
copy_array! (generic function with 1 method)

julia> n = 1000
1000

julia> @btime copy_array!(A, B) setup = (A = Matrix{Float64}(undef, n, n); B = randn(n, n));
  722.835 μs (0 allocations: 0 bytes)

this takes a bit less than a millisecond to copy one million elements. Broadcasting the entire array is slightly faster, but same order of magnitude:

julia> @btime A .= B setup = (A = Matrix{Float64}(undef, n, n); B = randn(n, n));
  424.848 μs (0 allocations: 0 bytes)

I don't think multiplication of matrices of these sizes would be faster than that

julia> @btime A * B setup = (A = randn(n, n); B = randn(n, n));
  14.267 ms (2 allocations: 7.63 MiB)

Note that broadcasting the entire array is on-par with memcpy:

julia> memcpy!(A, B) = ccall(:memcpy, Cvoid, (Ptr{Cvoid}, Ptr{Cvoid}, Csize_t), A, B, sizeof(B))
memcpy! (generic function with 1 method)

julia> @btime memcpy!(A, B) setup = (A = Matrix{Float64}(undef, n, n); B = randn(n, n))
  447.388 μs (0 allocations: 0 bytes)

so you can't get much faster than that.

ericagol commented 3 years ago

@giordano Thanks for the suggestions! Here's a script which sets up the outer solar system and runs NbodyGradient for 5,000,000 days with 50-day time steps (first without, then with derivatives):

NBG_performance.zip

I then used the integrate function to profile the code.

giordano commented 3 years ago

Ok, I had a look, but frankly I think those functions are already pretty quick, I don't think you can speed them up by much. You're also already looping in the correct (column-major) order.

With your ic, I get

julia> using BenchmarkTools

julia> s = State(ic);

julia> d = Derivatives(Float64, s.n);

julia> @btime copy_submatrix!($(s), $(d), 21, 28, 35)
  283.110 ns (0 allocations: 0 bytes)

This has copied 994 elements across different arrays. On average, the copy of one element takes tenths of one nanosecond, which is on the same order of magnitude as the benchmarks above, including the memcpy.

I believe your issue is that you call these functions several times?

ericagol commented 3 years ago

@giordano I am wondering whether the issue is that @profile doesn't profile the BLAS routines? I'm finding that the multiplications are very quick relative to copying the matrices, which doesn't make sense. But, perhaps this is due to the BLAS routines not being counted when the profiler sampling occurs during the multiply?

giordano commented 3 years ago

I believe the profiler should show the time spent in BLAS.gemm!, it's just that it can't go into the stack of the C BLAS library.

However I can see a similar effect:

julia> using ProfileView

julia> function test!(s::State{T}, d) where {T<:AbstractFloat}
           zilch = zero(T)
           uno = one(T)
           indi, indj, sevn = 21, 28, 35
           NbodyGradient.copy_submatrix!(s, d, indi, indj, sevn)
           BLAS.gemm!('N','N', uno, d.jac_ij, d.jac_tmp1, zilch, d.jac_tmp2)
           NbodyGradient.ypoc_submatrix!(s, d, indi, indj, sevn)
           return nothing
       end
test! (generic function with 2 methods)

julia> function test!(s::State{T}, d, n)  where {T<:AbstractFloat}
           for _ in 1:n
               test!(s, d)
           end
       end
test! (generic function with 2 methods)

julia> @profview test!(s, d, 1); # warmup profiling

julia> @profview test!(s, d, 10000);

The time spent in BLAS.gemm! is the pink block highlighted by the tooltip, on the left there is the NbodyGradient.copy_submatrix! block, on the right is NbodyGradient.ypoc_submatrix!. The weird thing is that these operations individually should have very different timings:

julia> @btime NbodyGradient.copy_submatrix!($(s), $(d), 21, 28, 35);
  281.380 ns (0 allocations: 0 bytes)

julia> @btime NbodyGradient.ypoc_submatrix!($(s), $(d), 21, 28, 35);
  285.982 ns (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!($('N'), $('N'), $(1.0), $(d.jac_ij), $(d.jac_tmp1), $(0.0), $(d.jac_tmp2));
  1.043 μs (0 allocations: 0 bytes)

So copy_submatrix! + ypoc_submatrix! should be about half of BLAS.gemm!. I wonder if profiling may be seeing some artifacts of repeated calls to the function, like caching of the results or something like that.

I have the feeling copy_submatrix! and ypoc_submatrix! are copying around too many elements, compared to the matrix multiplications you're doing: BLAS.gemm!('N', 'N', 1.0, d.jac_ij, d.jac_tmp1, 0.0, d.jac_tmp2) (which is pretty much equivalent to mul!(d.jac_tmp2, d.jac_ij, d.jac_tmp1)) is the matrix multiplication d.jac_ij * d.jac_tmp1, which have sizes (14, 14), (14, 35), respectively, but as I've mentioned above copy_submatrix!(s, d, 21, 28, 35) copies around 994 elements. In ah18! there is also another matrix multiplication, mul!(d.tmp14, d.jac_ij, d.dqdt_tmp1), but this even smaller (and much faster than copy_submatrix!): d.dqdt_tmp1 is a vector (14, 1).

In my first example above, the matrix multiplication involved matrices of the same size of those copied around, and that operation would dwarf data movement. Here instead matrix multiplications are smaller.

You may want to aks also on Discourse, hopefully someone can come up with other ideas, but perhaps you should rethink these data structures to avoid or minimise data movement.

ericagol commented 3 years ago

@giordano Thanks, Mosè, this is very helpful! The older version of this code did not use structures, so I am trying to profile it, and see if this is what is causing the issue. If not, then I'm going to try to avoid the allocation altogether, which will require passing bigger matrices to BLAS, but may end up being faster (I've outlined this now in issue #43 ). -Eric

ericagol commented 3 years ago

It looks like the older version of NbodyGradient (which can still be found in the TRAPPIST1_Spitzer repo) is also spending much more time on copying matrices than on the BLAS multiplication. So, I'm going to try to implement issue #43 .

ericagol commented 3 years ago

I think the profiler is not timing the BLAS routine correctly. I tried eliminating the copies, and simply placing the jacobian directly into the full matrix. The lines in which BLAS are called still had a very small number of hits, while the overall algorithm takes about 67% longer. Since the copying is no longer taking any time, this increase in time can only be due to the BLAS multiply (there is some more time spent in zeroing out the matrices, but this is minimal). All to say, I think the profiler is not counting the time spent in BLAS.

giordano commented 3 years ago

Are you on macOS? It may be https://github.com/JuliaLang/julia/issues/38350

giordano commented 3 years ago

Ok, I just discovered that to enter the stack of C/Fortran calls one needs to pass the C=true argument to Profile.print() (or ProfileView.view() if using ProfileView.jl). The side effect is that you get also unrelated C calls, coming from Julia internals (reason why these backtraces are excluded by default). The issue on macOS however may still stand.

ericagol commented 3 years ago

Indeed, I just came across this post:

https://discourse.julialang.org/t/profiling-doesnt-count-matrix-products-correctly/40420

and it refers to this one:

https://github.com/JuliaLang/julia/issues/33605

I tried running with C=true, and now it seems to track these:

julia> Profile.print(format=:flat;C=true,mincount=10000)
 Count  Overhead File                                                   Line Function
 =====  ======== ====                                                   ==== ========
 44912         0 @Base/boot.jl                                           360 eval(m::Module, e::Any)
 44912         0 @Base/client.jl                                         387 (::Base.var"#878#880"{Bool, Bool, Bool})(REPL::Module)
 44912         0 @Base/client.jl                                         485 _start()
 44912         0 @Base/client.jl                                         302 exec_options(opts::Base.JLOptions)
 44912         0 @Base/client.jl                                         372 run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bo...
 44912         0 @Base/essentials.jl                                     707 #invokelatest#2
 44912         0 @Base/essentials.jl                                     706 invokelatest
 44912         0 ...esources/julia/lib/julia/libjulia-internal.1.dylib     ? _jl_invoke
 44912         0 ...esources/julia/lib/julia/libjulia-internal.1.dylib     ? do_apply
 44912         0 ...esources/julia/lib/julia/libjulia-internal.1.dylib     ? do_call
 44912         0 ...esources/julia/lib/julia/libjulia-internal.1.dylib     ? eval_body
 44912         0 ...esources/julia/lib/julia/libjulia-internal.1.dylib     ? jl_apply_generic
 44912         0 ...esources/julia/lib/julia/libjulia-internal.1.dylib     ? jl_f__apply_latest
 44912         0 ...esources/julia/lib/julia/libjulia-internal.1.dylib     ? jl_interpret_toplevel_thunk
 44912         0 ...esources/julia/lib/julia/libjulia-internal.1.dylib     ? jl_toplevel_eval_flex
 44912         0 ...esources/julia/lib/julia/libjulia-internal.1.dylib     ? jl_toplevel_eval_in
 44912         0 ...esources/julia/lib/julia/libjulia-internal.1.dylib     ? repl_entrypoint
 44912         0 ...esources/julia/lib/julia/libjulia-internal.1.dylib     ? true_main
 44912         0 ...6.app/Contents/Resources/julia/lib/julia/sys.dylib     ? jfptr_YY.878_38924.clone_1
 44912         0 ...6.app/Contents/Resources/julia/lib/julia/sys.dylib     ? jfptr__start_33424.clone_1
 44904        17 ...are/NbodyGradient/test/outer_ss/NBG_performance.jl    92 integrate(s::State{Float64}, h::Float64, N::Int64; grad::Bool)
 44905         0 ...are/NbodyGradient/test/outer_ss/NBG_performance.jl    88 (::var"#integrate##kw")(::NamedTuple{(:grad,), Tuple{Bool}}, ::typeof(integr...
 44912         0 ...build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl   317 run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool)
 44912         0 ...build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl   139 eval_user_input(ast::Any, backend::REPL.REPLBackend)
 44912         0 ...build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl   200 repl_backend_loop(backend::REPL.REPLBackend)
 44912         0 ...build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl   305 run_repl(repl::REPL.AbstractREPL, consumer::Any)
 44912         0 ...build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl   185 start_repl_backend(backend::REPL.REPLBackend, consumer::Any)
243513    243513 /usr/lib/system/libsystem_kernel.dylib                    ? __psynch_cvwait
 44905         0 [any unknown stackframes]
Total snapshots: 324684

ericagol commented 3 years ago

The last two lines before "Total snapshots" don't show up without C=true. I'm not sure what "__psynch_cvwait" is, but it looks like this is where 75% of the time is being spent (it's odd that this fraction is precisely 75%; also, the numbers don't fully add up right, even when I don't use mincount=10000).

ericagol commented 3 years ago

So, I think what is going on is that the BLAS call is simply not being profiled properly (even with C=true I don't get information about which Julia lines the code is spending time on). And, since the code is spending most of its time in the BLAS routine, finding extra time savings in the other portions of the code won't improve the speed much (which is disappointing, but good to know!). Thanks for your help with this, @giordano!

ericagol commented 3 years ago

Hmmm... I tried to total up the time spent each line with a call to BLAS, and it comes out to 38%, which is less than the 75% the profile indicated; perhaps there is other Julia code which is lowered to C as well. It's too bad that I can't use profile to find the total time spent on each line of code, but this does mean that if we can improve the speed of the other 62% of the time spent in Julia code, we can improve the overall run time.

ericagol commented 3 years ago

I'm closing this for now; thanks!

ericagol / NbodyGradient.jl

Copying sub-arrays from Jacobians #39