JuliaStats / Statistics.jl

The Statistics stdlib that ships with Julia.
https://juliastats.org/Statistics.jl/dev/
Other
71 stars 40 forks source link

Julep: allow passing an operator to `var` #29

Open timholy opened 4 years ago

timholy commented 4 years ago

This is a bit of a cross-post from https://github.com/JuliaGraphics/ColorVectorSpace.jl/issues/126, but let me provide a more succinct summary specialized to this package. Consider the following demo:

julia> using Statistics, StaticArrays

julia> a = rand(SVector{3,Float32}, 3)
3-element Array{SArray{Tuple{3},Float32,1,3},1}:
 [0.46784067, 0.55236137, 0.30518198]
 [0.45679474, 0.8746067, 0.7767879]  
 [0.30195618, 0.48870897, 0.37561262]

julia> var(a)
3-element SArray{Tuple{3},Float32,1,3} with indices SOneTo(3):
 0.008602443
 0.042801797
 0.06471903 

julia> var(a; dims=1)
ERROR: MethodError: no method matching abs2(::SArray{Tuple{3},Float32,1,3})
Closest candidates are:
  abs2(::Missing) at missing.jl:100
  abs2(::Bool) at bool.jl:84
  abs2(::Real) at number.jl:157
  ...
Stacktrace:

There's a sense in which the error is the correct outcome, and the one that returns a vector is a bit of a misnomer; after all, the definition of var as E[(X-μ)^2] requires that one can compute y^2 for objects drawn from X, but * does not mean elementwise multiplication.

Over in ColorVectorSpace we've hit upon the following idea: defining (\cdot) as the "dot product", operating elementwise (like broadcasting-* but we can't pass that as a function argument and therefore need a different operator), and (\otimes) to mean the outer product. In that circumstance, var(⋅, a) becomes the mean-square-error, var(⊙, a) becomes the elementwise variance, and var(⊗, a) is the covariance.

I'd be interested in your thoughts. I also think it makes sense for ColorVectorSpace and maybe StaticArrays (CC @c42f) to synchronize, if possible.

ararslan commented 4 years ago

That's a really cool idea! My only concern is with the proposed API, since for similar functions like mean, mean(f, x) means mean(map(f, x)). It would be kind of weird for it to mean something different for var, IMO.

timholy commented 4 years ago

I totally agree. It's a bit like mapreduce(⊗, +, a, a), I think. Not sure how to clarify this though.

kimikage commented 4 years ago

BTW, is the operation (X-μ) obvious? I allow that the color difference can be a color, but I don't think that is the only axiom.

timholy commented 4 years ago

As I'm thinking about it, X-μ is likely to enjoy more widespread support. Can you think of practical cases where the multiplication would be defined but the subtraction not?

kimikage commented 4 years ago

I am concerned of the semantically ambiguous definitions, not the absence of the definition. In fact, Colors.colordiff returns a scalar value, not a color or vector. I think (X-μ) is more semantically important than ^2. If we consider the contextual dependency of ^2, it might be reasonable to consider the contextual dependency of (X-μ), too. The above is an argument of "semantics". The syntactic problem is that the subtraction can lead to undesired promotions (of the element types).

This is different from the discussion of OP, but I hope Statistics provides an API for the square error function with "two" arguments for package developers.

c42f commented 4 years ago

To be honest, I'm quite surprised that var(rand(SVector{3,Float32}, 3)) even works: Naively I'd expect it to be an error because * is not defined on vectors!

Being able to use various definitions of * to get inner, outer and elementwise products would certainly be cool.

In fact, Colors.colordiff returns a scalar value, not a color or vector. I think (X-μ) is more semantically important than ^2.

This is an interesting observation. A more basic but very related problem is how μ should be computed. Currently we don't even support elements of an affine space in mean, even though the definition of mean as a convex combination is very clear there. Instead I think we require the eltype of the iterator given to mean to be in a vector space — or at least support vector addition and scalar multiplication? This is the option which leads to an efficient generic implementation in terms of sum.

(Err... though looking into the code for mean([[1,1], [1,1]]) it seems it we still compute [2,2]/2 and that's not an error even without broadcasting!! I wonder why?)

Anyway, I guess color spaces are generally curved spaces with a nontrivial metric? In that case it appears a variance can be defined consistently without computing the mean or any differences, for example: https://golem.ph.utexas.edu/category/2017/02/variance_in_a_metric_space.html

Exactly the same observations could be made for geographic coordinate systems, and the same tradeoffs that face the design of ColorVectorSpaces: sometimes one wants to treat points in the spaces as vectors which is an accurate local approximation. But globally the approximation becomes quite bad.

I think this points in an interesting but frustratingly complex direction:

c42f commented 4 years ago

Being able to use various definitions of * to get inner, outer and elementwise products would certainly be cool.

I should point out that regardless of being "cool" I know for a fact that being able to get the outer product version is also useful for SVector. This kind of covariance is the foundation of many algorithms for local feature extraction in point cloud processing.

timholy commented 4 years ago

Now I see your concern, @kimikage. Yes, if you have a metric, then more generally μ can be defined as the point that minimizes the sum-of-distances to all other points in the sample. This definition does not even require + or - to be defined. @c42f, thanks for the link, I'd not yet thought through to the obvious analog, that you can define the variance without having a mean. In practice that appears to turn its computation into an O(N^2) problem rather than an O(N) problem, where N is the number of points in the collection. But it would be nice to support this when it's necessary.

Looks like the covariance can be defined similarly: https://arxiv.org/pdf/1106.5758.pdf

(Err... though looking into the code for mean([[1,1], [1,1]]) it seems it we still compute [2,2]/2 and that's not an error even without broadcasting!! I wonder why?)

I'm probably misunderstanding, but vector spaces admit multiplication by a scalar.

I think this points in an interesting but frustratingly complex direction

Agreed, but I think it's clearly the right roadmap.

Wrt API, one option would be to allow keyword arguments var(iter; diff=-, square=*). There's a bit of a problem passing a custom function for computing the mean, because var already takes a keyword mean for passing the mean value. Maybe meanop=Statistics.mean?

A agree with @c42f that ideally we might want a trait to pick these defaults automatically. However, in cases where the normal definitions don't apply, perhaps (as in the case of wanting to support multiple notions of product) it might be better to have people specify the operations directly. We could even support the case diff=nothing, square=nothing if the user also supplied a metric keyword.

kimikage commented 4 years ago

I don't want to write many keyword arguments:sweat_smile:, so I think the level of abstraction as follows is fine:

var((x, μ)->(x - μ)^2, [1, 2, 3])
timholy commented 4 years ago

Nice. This emphasizes that we hardly need var, the above is just

μ = mean(iter)
v = mean(x->(x-μ)^2, iter)

other than for the N/N-1 bias correction.

kimikage commented 4 years ago

we hardly need var

You're right. However, the element-wise function with two arguments is simple when considering the dims keyword. cf. something complicated https://github.com/JuliaImages/Images.jl/issues/872#issuecomment-597069100

c42f commented 4 years ago

I'm probably misunderstanding, but vector spaces admit multiplication by a scalar.

Sure, but this is literally computed as [2,2]/2, so it appears we've allowed not just scalar multiplication, but scalar division as a special case. I think this is surprising as we decided not to allow other similar things like [2,2] + 2.

(I can guess that this might have been to preserve correct rounding and solve some promotion issues, but a lazy inverse type would have done the same.)

c42f commented 4 years ago

I think this is surprising as we decided not to allow other similar things like [2,2] + 2.

Thinking further about that, I guess scalar division isn't ambiguous in the same way that scalar addition is, so maybe it's fine... I'll stop thinking about that for now :-D