JuliaStats / Distances.jl

A Julia package for evaluating distances (metrics) between vectors.
Other
433 stars 98 forks source link

colwise function #138

Open musm opened 5 years ago

musm commented 5 years ago

Why is there only a colwise function and not also a matching rowwise function?

dkarrasch commented 5 years ago

But colwise and pairwise have very different meanings here, independent of whether data points are viewed as columns or rows. colwise is meant to compute distances only between corresponding data points/columns. pairwise is meant to compute all pairwise distances, and can be used to compute the distances between matrices with different number of data points (say, columns), which is impossible for colwise, except for when one of the two data sets is a single data point, i.e., a vector.

musm commented 5 years ago

Good point regarding pairwise (comment edited). I think I was getting at a rowwise that computes the distances between rows of the matrices i.e. rowwise(dist, X,Y) = colwise(dist, X', Y')

johnnychen94 commented 5 years ago

I suppose one of reason here is because julia matrix is stored in column-major order, so a row-wise loop might raise performance issue.

julia> x = rand(100, 100);

julia> y = rand(100, 100);

julia> @benchmark colwise(Euclidean(), x, y)
BenchmarkTools.Trial: 
  memory estimate:  896 bytes
  allocs estimate:  1
  --------------
  minimum time:     1.742 μs (0.00% GC)
  median time:      1.804 μs (0.00% GC)
  mean time:        2.268 μs (19.13% GC)
  maximum time:     3.196 ms (99.90% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark colwise(Euclidean(), x', y')
BenchmarkTools.Trial: 
  memory estimate:  928 bytes
  allocs estimate:  3
  --------------
  minimum time:     8.752 μs (0.00% GC)
  median time:      8.893 μs (0.00% GC)
  mean time:        9.459 μs (3.82% GC)
  maximum time:     3.631 ms (99.62% GC)
  --------------
  samples:          10000
  evals/sample:     3
nalimilan commented 5 years ago

With https://github.com/JuliaLang/julia/pull/32310 we should probably drop colwise(dist, x, y) in favor of map(d, eachol(x), eachcol(y)). Cc: @simonbyrne

dkarrasch commented 5 years ago

I guess we should keep colwise(dist, x, y), make map(d, eachcol(x), eachcol(y)) the default, but allow specialized methods to optimize for performance.

using Distances, BenchmarkTools
d = Euclidean(); a = rand(5, 100); b = rand(5, 100);
@btime map($d, $(eachcol(a)), $(eachcol(b))); # 3.090 μs (304 allocations: 13.45 KiB)
@btime colwise($d, $a, $b); # 380.907 ns (1 allocation: 896 bytes)
simonbyrne commented 5 years ago

@dkarrasch the point of https://github.com/JuliaLang/julia/pull/32310 is that you could write specialized versions that can leverage the memory layout: in this case, you could do:

Base.map(d::Distance, a::EachCol, b::EachCol) = colwise(d, parent(a), parent(b))