JuliaManifolds / ManifoldML.jl

Other
10 stars 1 forks source link

MLJ Integration #3

Open ablaom opened 3 years ago

ablaom commented 3 years ago

Continuing the discussion in https://github.com/JuliaManifolds/ManifoldML.jl/issues/2 (and on slack). Some thoughts on what the issues might be.

As I understand it, a point on an arbitrary Manifold object does not generally know what manifold to which it belongs, correct? This is fine as far the working with these points internal to your manifold-specific algorithms, but not ideal from the point of view of integration with the rest of the ML ecosystem. The problem is roughly analogous to categorical variables. Internally these are usually represented as integers, but algorithms still need to know the total number of possible classes to avoid problems, such as certain classes disappearing on resampling. Passing this information around is not as easy as it first appears. Life is much easier (for a tool box like MLJ) if we simply assume every point knows all the classes - and that is why we (and other packages) insist on the use of CategoricalArrays for representing such data (although ordinary arrays of some "categorical value" type would also have sufficed.)

In the future, we might have algorithms which deal with mixed data types, one or more or which is a manifold type (think of geophysical applications) and having to keep track of metadata for a subset of variables gets messy.

So my tentative suggestion would be that MLJ users would present input data for a supervised learning algorithm from the ManifoldML package as an abstract vector of "manifold points", where a "manifold point" is a point which combines the manifold to which the point belongs with some internal representation. This could be as simple as a tuple (M, p), for example. We define a new scientific type ManifoldPoint{M} where M is the concrete manifold type and declare scitype( (M, p)) = ManifoldPoint{typeof(M)}`. Then your input type declarations in the implementation of the MLJ interface would look something like:

input_scitype(::ManifoldKNNRegressor) = AbstractVector{<:ManifoldPoint{<:MetricManifold}}

And the rest would be straightforward, I should think.

Other random thoughts:

@kellertuer @mateuszbaran Your thoughts?

mateuszbaran commented 3 years ago

I think this sounds like a good plan. We can definitely make a wrapper so that each point knows its manifold. In fact that's how my early prototypes worked but it turns out to be very inefficient for many algorithms. But an MLJ <-> Manifolds compatibility layer could just unwrap and wrap the result, this is fine.

  • Maybe there is some way to "decorate" existing manifolds to enforce the kind of point representation we want. I don't really understand this decorating business enough to say, or if this is really an advantage.

That decorator thing we have isn't particularly intuitive but works fine for our purposes. In this case however representation needs to be enforced at a different level. We will figure something out.

  • Maybe we want to refine the scitype to include the number_type as type parameter

How would that information be used in the MLJ ecosystem? Manifolds.jl is quite good at figuring out types of temporaries and results from types of arguments.

ablaom commented 3 years ago

How would that information be used in the MLJ ecosystem? Manifolds.jl is quite good at figuring out types of temporaries and results from types of arguments.

We wouldn't need it. It would only be necessary if you can imagine an algorithm which would only work for manifolds with that given number_type. We only need to include what is necessary for you to articulate your requirements on the input.

kellertuer commented 3 years ago

Thanks for your ideas.

Concerning the “a point does not know which manifold it belongs to” – I see a small problem for efficiency attaching the manifold to any point (though our manifolds usually are only a few integers of information/storage). Maybe it would also be a good idea to store the manifold only with a batch of data? If we have a set of points (the training set for example) they all live on the same manifold. Would that be possible?

Concerning the decorator – it might take a while to carefully understand that approach we follow there, the rough idea is as follows: For a manifold many things are assumed to exist – for example the metric. When people speak of the sphere they have the round metric in mind, often without specifically thinking about it. So we implemented the (default) sphere exactly like that. If now someone wants to “break out” this default assumption and implement another metric (yielding other geodesics, distances), most things stay the same – the manifold dimension for example. So one can deecorate or wrap the default sphere in a MetricManifold which can be used to dispatch the distance to its new implementation. Everything unrelated to a metric is just taken from the default implementation.

Concerning your idea of enforcing a point representation. That should just be doable with <:MPoint and <:TVector types actually. Even more – without storing the manifold explicitly one could – for these certain types of points and vectors provide a get_manifold(::MyRepresentationMPoint). What do you think? This would be more flexible than storing the manifold Of course we do not necessarily have to take our point/vector types, our implementations are more flexible. Still from a type (and maybe the internal array size as a parameter) one can for sure determine the corresponding manifold.

mateuszbaran commented 3 years ago

Concerning the “a point does not know which manifold it belongs to” – I see a small problem for efficiency attaching the manifold to any point (though our manifolds usually are only a few integers of information/storage). Maybe it would also be a good idea to store the manifold only with a batch of data? If we have a set of points (the training set for example) they all live on the same manifold. Would that be possible?

Performance wouldn't be affected that much on Julia 1.5+ thanks to the memory layout changes of structs. I usually do care about performance and I don't think it would be a problem for this interface 🙂 . I will be perfectly fine with something like

struct PointAndManifold{TP,TM<:Manifold} <: MPoint
    p::TP
    M::TM
end
kellertuer commented 3 years ago

Then I am also fine with that variant, for sure.

ablaom commented 3 years ago

So there seem a few ways to move forward here:

  1. MLJ users provide input features as vectors of tuples of the form (p, M), where M is a manifold. We introduce a new scientific type ManifoldPoint{TM} and implement scitype((p,M::TM)) where TM<:Manifold = ManifoldPoint{TM}.

  2. A new struct is introduced as above (in the Manifolds.jl ecosystem), which is the type MLJ user present. I would reverse the order of the type parameters. Then the union type PointAndManifold{TM} could double duty as a scientific type (no need to add one to ScientifictTypes.jl) and we implement scitype(::PointAndManifold{TM}) where TM<:Manifold = TM. (For what it's worth, I would prefer the name ManifoldPoint or PointOnManifold to PointAndManifold.)

  3. We introduce both the new struct (in Manifolds*.jl) and a new scientific type (with different name).

  4. ?

@kellertuer @mateuszbaran Do you have a preference for how you want to proceed?

Side question: Do you have models where tangent vectors would be part of data presented by MLJ users? That is, do we need analogues of the above for tangent vectors?

kellertuer commented 3 years ago

I would prefer ManifoldPoint with 1, and we could provide an easy way to use those, i.e. define exp(p::ManifoldPoint, X) = exp(p.M, p.p, X) and such for ease of use.

Concerning the tangent vectors – we also thought about that, it's actually easy: A tangent vector X has to “know” its base point, but the tuple (p,X) is already a point on the Tangent bundle, so anyways a point on a manifold. This is already implemented, its a special case of a vector bundle https://juliamanifolds.github.io/Manifolds.jl/stable/manifolds/vector_bundle.html – we can surely highlight a tangent bundle more prominently.

mateuszbaran commented 3 years ago

I'm fine with either variant 1 or 2. It would be nice if users could just add or multiply by scalars tangent vectors wrapped in ManifoldPoint but then the implementation has to be aware of our VectorBundle. I'm not quite sure now where such methods should be defined in variant 1.

i.e. define exp(p::ManifoldPoint, X) = exp(p.M, p.p, X) and such for ease of use.

That may not be the best example because X in this interface would not be an array. exp could just work on elements of the tangent bundle, right?

kellertuer commented 3 years ago

Ah, but to distinguish that correctly we might need a TangentVector indeed? I don't think so, since for p being a point on the manifold, X can be an array that's fine and for p being on the tangent bundle, Xwould be an array from the “product tangent space”, i.e. (TpM)^2 but could still be a simple array in your ProductRepr style?

mateuszbaran commented 3 years ago

OK, after some discussion on Slack the conclusion is that MLJ could just do variant 1 and we will work out details of integration on the Manifolds side.