JuliaStats / Distances.jl

A Julia package for evaluating distances (metrics) between vectors.
Other
425 stars 96 forks source link

Quite different performance of pairwise on CPU vs GPU #143

Open xukai92 opened 5 years ago

xukai92 commented 5 years ago

I found the pairwise with SqEuclidean is faster than my own implementation on CPU but slower on GPU. Any idea why and possible optimization on Distances.jl side?

MWE:

using Test, BenchmarkTools, Distances, CuArrays

function pairwise_dot_kai(x)
    d, n = size(x)
    xixj = x' * x
    xsq = sum(x .^ 2; dims=1)
    return repeat(xsq, n, 1) + repeat(xsq', 1, n) - 2xixj
end

pairwise_dot(x) = pairwise(SqEuclidean(), x; dims=2)

xbench = randn(Float32, 784, 200);

@benchmark pairwise_dot_kai(xbench)
BenchmarkTools.Trial: 
  memory estimate:  1.52 MiB
  allocs estimate:  17
  --------------
  minimum time:     854.227 μs (0.00% GC)
  median time:      1.183 ms (0.00% GC)
  mean time:        1.361 ms (12.37% GC)
  maximum time:     125.259 ms (98.46% GC)
  --------------
  samples:          3662
  evals/sample:     1
@benchmark pairwise_dot(xbench)
BenchmarkTools.Trial: 
  memory estimate:  166.59 KiB
  allocs estimate:  204
  --------------
  minimum time:     359.751 μs (0.00% GC)
  median time:      406.615 μs (0.00% GC)
  mean time:        458.925 μs (6.46% GC)
  maximum time:     104.066 ms (99.27% GC)
  --------------
  samples:          10000
  evals/sample:     1
xbench = xbench |> cu;
@benchmark pairwise_dot_kai(xbench)
BenchmarkTools.Trial: 
  memory estimate:  1.20 MiB
  allocs estimate:  19424
  --------------
  minimum time:     19.042 ms (0.00% GC)
  median time:      20.028 ms (0.00% GC)
  mean time:        21.811 ms (3.62% GC)
  maximum time:     52.425 ms (38.23% GC)
  --------------
  samples:          230
  evals/sample:     1
@benchmark pairwise_dot(xbench)
BenchmarkTools.Trial: 
  memory estimate:  10.99 MiB
  allocs estimate:  240635
  --------------
  minimum time:     453.229 ms (0.00% GC)
  median time:      470.074 ms (0.00% GC)
  mean time:        474.353 ms (2.67% GC)
  maximum time:     499.969 ms (6.04% GC)
  --------------
  samples:          11
  evals/sample:     1
xukai92 commented 5 years ago

PS: the GPU support of pairwise is based the branch of this PR https://github.com/JuliaStats/Distances.jl/pull/142

nalimilan commented 5 years ago

pairwise uses an explicit loop, while your implementation calls higher-level implementations like * and broadcast. I guess these are optimized for CUDA. Not sure what to do about this, since 1) pairwise works for many distances, some of which might not be rewritten using high-level operations and 2) Distances.jl shouldn't depend on CuArrays. Maybe we can add special methods once optional dependencies are supported in Julia.

johnnychen94 commented 5 years ago

Maybe we can add special methods once optional dependencies are supported in Julia.

I think Requires.jl provides this feature?

KristofferC commented 5 years ago

Requires.jl makes loading this package 20x slower (https://github.com/JuliaStats/Distances.jl/pull/123#issuecomment-466438344).

johnnychen94 commented 5 years ago

Requires.jl makes loading this package 20x slower

Technically, it's Tables.jl. In my laptop, Requires.jl slows loading this package by 4-6x

# with Requires.jl
julia> @time using Distances
  0.276511 seconds (403.27 k allocations: 21.736 MiB, 6.36% gc time)

# without Requires.jl
julia> @time using Distances
  0.063862 seconds (63.23 k allocations: 3.801 MiB)

But Requires.jl is still a heavy dependency compared to Distances.jl. I guess what we can possibly do now is to create a larger package CudaDistances.jl, accelerate codes for CuArray, and reexport Distances.jl in that package.

nalimilan commented 3 years ago

One solution to this would be to choose the best algorithm based on traits like those provided by ArrayInterface.