Memory efficiency - Githubissues

gdalle commented 1 year ago

https://github.com/borisblagov/Julia_AR4_Bayesian_Regression/blob/9e5aa1749f4b0afb59a645c1f0696e85935e1d04/src/NewB.jl#L12-L16

This is quite wasteful in terms of memory for two reasons:

X = [X Yfull[p-i+1:end-i,:]] creates a new matrix every time
Yfull[p-i+1:end-i,:] copies the submatrix you want

My solutions would be:

Pre-allocate memory and at every iteration, fill the parts of X that you need with dot syntax
Use views instead of slices

borisblagov commented 1 year ago

The branch mlag-function-memory-efficiency-from-Issue--Memory-efficiency-#1 explores possible solutions to both problems by introducing two changes to the mlag function. They final version is mlag with mlag_old_v1 and mlag_old_v2 changing the code for issues 1. and 2. respectively

Current suggestion for the loop is

    for i = 1:p
        X[:,1 + n*(i-1): n + n*(i-1)] = @view Yfull[p-i+1:end-i,:]
    end

where

X[:,1 + n*(i-1): n + n*(i-1)] should be equivalent to dot operations as according to the docs, indexing on the left hand side of an assignment assigns to portions of the array
@view Yfull[p-i+1:end-i,:] only reads from the array without copying it

julia> @time mlag(Z,p);
  0.000013 seconds (3 allocations: 7.125 KiB)

julia> @time mlag_old_v2(Z,p);
  0.000010 seconds (7 allocations: 12.000 KiB)

julia> @time mlag_old_v1(Z,p);
  0.000015 seconds (11 allocations: 23.906 KiB)

gdalle commented 1 year ago

With these kinds of benchmarks you should really use BenchmarkTools.jl for accurate measurements

borisblagov commented 1 year ago

True, here is with @btime

julia> @btime mlag(Z,p);
  2.289 μs (3 allocations: 7.12 KiB)
julia> @btime mlag_old_v1(Z,p);
  2.312 μs (11 allocations: 23.91 KiB)
julia> @btime mlag_old_v2(Z,p);
  1.878 μs (7 allocations: 12.00 KiB)

and with a larger Z matrix/ more lags.

julia> @btime mlag(Z,p);
  16.000 μs (4 allocations: 40.70 KiB)
julia> @btime mlag_old_v2(Z,p);
  12.200 μs (10 allocations: 75.20 KiB)
julia> @btime mlag_old_v1(Z,p);
  13.800 μs (19 allocations: 166.81 KiB)

With @btime the time indeed varies a bit and it's hard to say which impementation is the fastet but allocations go down with the suggestions.

gdalle commented 1 year ago

Beware, when you use @btime you need to interpolate global variables in order to avoid biases

borisblagov commented 1 year ago

Oh, I learned a lot of new things today, thanks.

julia> @benchmark mlag($Z,$p)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.000 μs …  3.935 ms  ┊ GC (min … max): 0.00% … 98.46%
 Time  (median):     18.700 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.461 μs ± 65.016 μs  ┊ GC (mean ± σ):  7.94% ±  2.90%

   ▁▂▄▇█▇▄▂▃▂    ▁▂▁     ▁▂▃▃▃▁                               ▂
  ████████████▇█████▇▇▆▆█████████▇▆▆▆▆▆▇▆▆▆▆▆▆▆▆▆▆▅▅▆▆▇▅▄▁▃▄▅ █
  16 μs        Histogram: log(frequency) by time      43.5 μs <

 Memory estimate: 40.67 KiB, allocs estimate: 3.

julia> @benchmark mlag(Z,p)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.100 μs …  1.652 ms  ┊ GC (min … max): 0.00% … 96.16%
 Time  (median):     18.400 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.904 μs ± 46.670 μs  ┊ GC (mean ± σ):  6.42% ±  2.88%

  ▁▁▃▆▇█▅▃       ▁▂           ▁                               ▂
  ██████████▇▆▇▇████▇▆▅▅▅▅▅▇▇██▇▇▅▆▆▅▅▄▄▄▄▄▆▅▆▄▁▄▄▅▄▅▄▅▄▃▅▃▄▅ █
  16.1 μs      Histogram: log(frequency) by time        43 μs <

 Memory estimate: 40.70 KiB, allocs estimate: 4.

and

julia> @btime mlag($Z,$p);
  16.100 μs (3 allocations: 40.67 KiB)

julia> @btime mlag_old_v2($Z,$p);
  12.000 μs (9 allocations: 75.17 KiB)

julia> @btime mlag_old_v1($Z,$p);
  13.500 μs (18 allocations: 166.78 KiB)

compared to

julia> @btime mlag(Z,p);
  16.000 μs (4 allocations: 40.70 KiB)

julia> @btime mlag_old_v2(Z,p);
  12.200 μs (10 allocations: 75.20 KiB)

julia> @btime mlag_old_v1(Z,p);
  13.800 μs (19 allocations: 166.81 KiB)

gdalle commented 1 year ago

Current suggestion for the loop is

    for i = 1:p
        X[:,1 + n*(i-1): n + n*(i-1)] = @view Yfull[p-i+1:end-i,:]
    end

I think you're still missing a dot in the assignment

X[:,1 + n*(i-1): n + n*(i-1)] .= @view Yfull[p-i+1:end-i,:]

That way we tell Julia to put every element of the right-hand side into the corresponding element in the left-hand side

gdalle commented 1 year ago

But if this doesn't speed things up, I guess we'll have relearned another lesson in code optimization: sometimes it doesn't pay to be clever ^^

borisblagov commented 1 year ago

Yeah, I've had that in Matlab a lot. Might be new to Julia but there I have spent vast amount of time to get a code to perform better only to have it either marginally or none at all 😄

I tried the dot operator and for small Z it did improve a bit and for large it doesn't.

@btime mlag($Z,$p)
  1.320 μs (2 allocations: 7.09 KiB)
julia> @btime mlag_old_v2($Z,$p)
  1.360 μs (6 allocations: 11.97 KiB)
@btime mlag_old_v1($Z,$p)
  1.780 μs (10 allocations: 23.88 KiB)

"Large" Z (i.e. 10 variables)

julia> @btime mlag_old_v2($Z,$p);
  12.100 μs (7 allocations: 104.67 KiB)

julia> @btime mlag_old_v1($Z,$p);
  10.400 μs (13 allocations: 178.05 KiB)

julia> @btime mlag($Z,$p);
  12.400 μs (3 allocations: 58.42 KiB)

Of course this is just me understanding the mechanisms and this doesn't matter but I it annoys me - I cannot grasp here understand why the dot operator would here have any effect (positive or negative).

In the docs it states that the dot operator is a broadcast with the added benefit that it can fuse the operations. I don't see here how fusing would play a role, so essentially I am just viewing the RHS and mapping it to the LHS with a view. So my understanding of that was that if there is nothing to fuse, there should be no difference.

here is the relevant part of the docs

If the left-hand side is an array-indexing expression, e.g. X[begin+1:end] .= sin.(Y), then it translates to broadcast! on a view, e.g. broadcast!(sin, view(X, firstindex(X)+1:lastindex(X)), Y), so that the left-hand side is updated in-place.

In any case thanks for your evil plan to open an issue 🙂

gdalle commented 1 year ago

In the docs it states that the dot operator is a broadcast with the added benefit that it can fuse the operations. I don't see here how fusing would play a role, so essentially I am just viewing the RHS and mapping it to the LHS with a view. So my understanding of that was that if there is nothing to fuse, there should be no difference.

Indeed for this particular case X[ind] .= @view Y[ind] I'm unsure whether the dot brings anything, whereas for X[ind] .= f.(Y[ind]) it is clear to me. Still I wanted to try it!

borisblagov commented 1 year ago

But it does! That is what I don't understand. I was too lazy to make mlag_old_v3

without dot

julia> @btime mlag($Z,$p);
  14.500 μs (3 allocations: 58.42 KiB)

with dot

julia> @btime mlag($Z,$p);
  12.700 μs (3 allocations: 58.42 KiB)

It's such a rabbit hole 😄

gdalle commented 1 year ago

An easy test is to remove the loop in the middle. If it doesn't change the number of allocations, then we've done all we could there. Also I suggested all of this based on reading your code, but in reality you should profile it, preferably with the @profview macro from VSCode. Otherwise you risk spending a lot of time optimizing the parts that don't contribute much running time

borisblagov / Julia_AR4_Bayesian_Regression

Memory efficiency #1