JuliaGPU / GPUArrays.jl

Reusable array functionality for Julia's various GPU backends.
MIT License
313 stars 74 forks source link

Add specializations for istriu/istril to speed up isdiag. #502

Closed maleadt closed 7 months ago

maleadt commented 7 months ago

Significantly speeds up isdiag and related checks, especially on nearly-triangular inputs:

julia> x = collect(UpperTriangular(rand(2048,2048)));
julia> y = cu(x);

julia> @benchmark istriu(x)
BenchmarkTools.Trial: 5467 samples with 1 evaluation.
 Range (min … max):  897.042 μs … 1.307 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     912.342 μs             ┊ GC (median):    0.00%
 Time  (mean ± σ):   912.978 μs ± 7.722 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▃▆▆▄▄▄▅▅█▇▅▃▃▅▅▃▃▂▁▁                        ▂
  ▃▁▁▃▁▃▁▃▃▄▆▅▅▄▇▇████████████████████████▇▇█▇▇▇▇▇▆▅▆▆▄▇▅▅▅▆▆ █
  897 μs       Histogram: log(frequency) by time       933 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark istriu(y)
BenchmarkTools.Trial: 52 samples with 1 evaluation.
 Range (min … max):  68.163 ms … 167.157 ms  ┊ GC (min … max): 0.00% … 9.36%
 Time  (median):     91.300 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   96.385 ms ±  22.002 ms  ┊ GC (mean ± σ):  1.73% ± 3.23%

     █ ▃█▃ ▃▃█ █ ▃▃  ▃ █  ▃▃    ▃
  ▇▁▇█▇███▇███▁█▇██▇▇█▁█▇▇██▇▁▇▁█▁▁▁▁▁▁▁▁▁▇▁▁▇▇▁▁▁▁▁▁▁▁▁▁▇▁▁▁▇ ▁
  68.2 ms         Histogram: frequency by time          158 ms <

 Memory estimate: 5.76 MiB, allocs estimate: 130855.

julia> @benchmark istriu(y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  35.099 μs …  1.205 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     53.339 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   53.877 μs ± 15.619 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▁▆▄▄▇▃▆▆▅▅▅█▅▃▆▃▄▄▃▂▁ ▁ ▂  ▁
  ▂▅▅▆▄▄▅▄▅▅▅▅▄▆▄▅▅▄▇██████████████████████████▇█▇██▇▇▇▆▅▅▇▄▃ ▅
  35.1 μs         Histogram: frequency by time        69.9 μs <

 Memory estimate: 4.17 KiB, allocs estimate: 80.

For completely non-triangular (i.e. random) inputs this will perform worse, because the generic implementation checks row by row, but I guess that's an edge case.

maleadt commented 7 months ago

Metal failure unrelated.