Open musm opened 5 years ago
But colwise
and pairwise
have very different meanings here, independent of whether data points are viewed as columns or rows. colwise
is meant to compute distances only between corresponding data points/columns. pairwise
is meant to compute all pairwise distances, and can be used to compute the distances between matrices with different number of data points (say, columns), which is impossible for colwise
, except for when one of the two data sets is a single data point, i.e., a vector.
Good point regarding pairwise (comment edited). I think I was getting at a rowwise
that computes the distances between rows of the matrices i.e. rowwise(dist, X,Y) = colwise(dist, X', Y')
I suppose one of reason here is because julia matrix is stored in column-major order, so a row-wise loop might raise performance issue.
julia> x = rand(100, 100);
julia> y = rand(100, 100);
julia> @benchmark colwise(Euclidean(), x, y)
BenchmarkTools.Trial:
memory estimate: 896 bytes
allocs estimate: 1
--------------
minimum time: 1.742 μs (0.00% GC)
median time: 1.804 μs (0.00% GC)
mean time: 2.268 μs (19.13% GC)
maximum time: 3.196 ms (99.90% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark colwise(Euclidean(), x', y')
BenchmarkTools.Trial:
memory estimate: 928 bytes
allocs estimate: 3
--------------
minimum time: 8.752 μs (0.00% GC)
median time: 8.893 μs (0.00% GC)
mean time: 9.459 μs (3.82% GC)
maximum time: 3.631 ms (99.62% GC)
--------------
samples: 10000
evals/sample: 3
With https://github.com/JuliaLang/julia/pull/32310 we should probably drop colwise(dist, x, y)
in favor of map(d, eachol(x), eachcol(y))
. Cc: @simonbyrne
I guess we should keep colwise(dist, x, y)
, make map(d, eachcol(x), eachcol(y))
the default, but allow specialized methods to optimize for performance.
using Distances, BenchmarkTools
d = Euclidean(); a = rand(5, 100); b = rand(5, 100);
@btime map($d, $(eachcol(a)), $(eachcol(b))); # 3.090 μs (304 allocations: 13.45 KiB)
@btime colwise($d, $a, $b); # 380.907 ns (1 allocation: 896 bytes)
@dkarrasch the point of https://github.com/JuliaLang/julia/pull/32310 is that you could write specialized versions that can leverage the memory layout: in this case, you could do:
Base.map(d::Distance, a::EachCol, b::EachCol) = colwise(d, parent(a), parent(b))
Why is there only a
colwise
function and not also a matchingrowwise
function?