Closed tlienart closed 4 years ago
eventually then MLJBase would itself import MLJAbstract
Thank you @tlienart for the careful design. At a first glance, this approach works nicely and fits all devs needs. Basically, MLJModelInteface.jl (my proposal name) would be a package that is imported by MLJBase.jl and reexported so that devs interested in implementing a model would depend on MLJModelInterface.jl and devs interested in implementing more integration with the stack would depend on MLJBase.jl as it is currently done. That is the idea, right?
The model struct should be a subtype of
Deterministic
orProbabilistic
orUnsupervised
so these things should be available inMLJAbstract
I think we can do even better now that we have this opportunity to rethink the model interface. We can consider not forcing devs to subtype their own types with types defined in the stack. Perhaps, it is possible to safely exploit Julia's trait system so that we get extensibility without inheritance. There is a nice blog post I can't find now by @oxinabox explaining how to achieve generics in Julia with traits. It is something to think about. I am considering it seriously for my own types in GeoStats.jl for example because I want the methods implemented there to work on any domain type not implemented by myself and after the fact. By just implementing the right methods, users can send their domain types, models, etc to the stack without having to build wrapper types just to inherit from a type defined in our packages.
More specifically, it seems to me that the types Supervised
, Unsupervised
, Probabilistic
, Deterministic
are actually traits in a more general and extensible trait system. Unless they are strictly needed to facilite dispatch internally in MLJ, we could consider making them traits like:
issupervised(model::MLJModel) = false
isprobabilistic(model::MLJModel)= false
...
I actually did that in GeoStatsBase.jl here: https://github.com/juliohm/GeoStatsBase.jl/blob/master/src/learning/traits.jl
Now someone that has a type already implemented with a non-compatible interface can just add these two lines and have the model integrated with MLJ. Makes sense? No need to create a wrapper type to inherit from Supervised
or Probabilistic
. Also since Julia doesn't support multiple inheritance this gives us more expressivity to categorize models according to multiple criteria.
This is the traits post you were looking for https://invenia.github.io/blog/2019/11/06/julialang-features-part-2/
Before committing us on a course of action I should like to give this some more thought. However, my initial reaction to the proposal is postive, with the following comments:
@tlienart I don't like the idea of a "shadow" ScientificTypes within the interface package; I think there should be a cleaner solution. How about we instead split ScientificTypes
into ScientificTypes
and ScientificTypesBase
? The base package would just have the type definitions, and the functionality that is independent of any particular convention. This removes all the heavy lifting and will leave Tables.jl as it's sole dependency. Tables.jl takes 0.6 seconds to load on my machine. (Currently ScientificTypes takes 2.1 seconds.) The new MLJModelInterface
package imports ScientificTypesBase
, which is all it would need.
(edit) @tlienart I'm assuming MLJBase/src/model_traits.jl
, which defines stubs/fallbacks for the model traits, would go intoMLJModelInterface
, under your proposal?
@juliohm I agree that traits (which MLJ uses extensively already) would be better for Model
and its subtypes, Supervised
, and so forth. However, this represents a significant refactoring of code, spread over several repos. I'm not sure I could justify devoting core dev team resources to this at this point. If we had a suitable volunteer familiar the code and with plenty of free time... Or if Google donates one million dollars to the project... 😄
Before committing us on a course of action I should like to give this some more thought.
definitely
@tlienart I don't like the idea of a "shadow" ScientificTypes within the interface package; I think there should be a cleaner solution. How about we instead split ScientificTypes into ScientificTypes and ScientificTypesBase? The base package would just have the type definitions, and the functionality that is independent of any particular convention. This removes all the heavy lifting and will leave Tables.jl as it's sole dependency. Tables.jl takes 0.6 seconds to load on my machine. (Currently ScientificTypes takes 2.1 seconds.) The new MLJModelInterface package imports ScientificTypesBase, which is all it would need.
IMO even loading Tables
is too much, I understand what you're saying but it seems too much to ask to devs to load Tables
just because it makes our lives easier if they otherwise don't work with tables in their package but, for instance, just with bare matrices (the case of MLJLM for instance).
I think the "shadow typing" would be pretty easy to maintain and, inherently, not very different from having a "ScientificBase" package (except much faster as it doesn't require Tables). It may seem a bit ad-hoc but it seems like a small price to pay for increased adoption by devs.
(edit) @tlienart I'm assuming MLJBase/src/model_traits.jl, which defines stubs/fallbacks for the model traits, would go intoMLJModelInterface, under your proposal?
Not necessarily. We just need functionalities to allow the dev to declare the traits in some prescribed way assuming it will be interpreted by MLJBase; whatever as long it's very lightweight.
Finally we could offer a MLJBase.test_implementation(m)
that checks everything is fine which devs would put in their tests (where they would, actually, load MLJBase, but only there).
Side note: Implementers of probabilistic classifiers need Distributions
because UnivariateFinite
defined in MLJBase subtypes Distributions.Distribution
. This package is quite large but also pretty much essential for any serious statistical software that does not define their own. I'm guessing this is considerably bigger than Tables
.
@tlienart I have a general trick to remove dependencies in any interface, when these dependencies are needed only for functions (not types). In our case this suffices, because the only types that need to be referenced in MLJModelInterface are the ScientificTypes, which we can move into a ScientificTypesTypesOnlyPackage which will have trivial overhead.
Here's the general trick: First, define a mutable flag INTERFACE_FUNCTIONALITY
in the lightweight interface package - let's call it Light.jl - which is for declaring whether we are in Dummy()
or Live()
mode, and to set it there to Dummy()
.
in LightInterface.jl:
abstract type Mode end
struct Dummy <: Mode end
struct Live <: Mode end
const INTERFACE_FUNCTIONALITY = Ref{Mode}(Dummy())
Any function we want from some dependent package - matrix
from Tables.jl, say - we define right there in LightInterface.jl but dispatch it on the flag value, giving it a dummy return value when the flag is Dummy()
.
in Client package:
matrix(t) = matrix(t, INTERFACE_FUNCTIONALITY[])
matrix(t, ::Dummy) = error("Only `LightInterface` loaded. Do `import Interface`.")
Any client wanting to implement the interface can now import LightInterface and make calls like matrix(t)
(and t
could be anything). Of course executing the client's code will fail at this point. However, we fix this in the full interface package, let's call it Interface.jl.
First, the __init__
function of Interface.jl
resets the flag to Live()
. Since Interface imports LightInterface, the flag will never be Dummy()
so long as Interface has been imported. Second, we
complete the definition of matrix
in the Live()
case, now that Tables.jl is available (Tables.jl being a dependency of Interface.jl).
in Interface.jl:
function __init__()
LightInterface.INTERFACE_FUNCTIONALITY[] = LightInterface.Live()
end
using ..LightInterface
import Tables # the elephant LightInterface didn't want
LightInterface.matrix(t, ::Live) = Tables.matrix(t)
Now, if the client package and Interface.jl are both loaded, it can actually be used.
What do you think?
Here's a proof of concept:
THE LIGHT INTERFACE
module LightInterface
export Dummy, Live
export matrix
abstract type Mode end
struct Dummy <: Mode end
struct Live <: Mode end
const INTERFACE_FUNCTIONALITY = Ref{Mode}(Dummy())
matrix(t) = matrix(t, INTERFACE_FUNCTIONALITY[])
matrix(t, ::Dummy) =
error("Only `LightInterface` loaded. Do `import Interface`.")
end
MODULE IMPLEMENTING THE LIGHT INTERFACE
module Client
export double
using ..LightInterface
double(t) = 2*matrix(t)
end
CAN LOAD CLIENT CODE; JUST CANT USE IT
julia> using .Client
julia> t = (x1=rand(3), x2=rand(3)) # a table
julia> double(t)
ERROR: Only `LightInterface` loaded. Do `import Interface`.
THE FULL INTERFACE
module Interface
export matrix
using ..LightInterface
import Tables # the elephant LightInterface didn't want
# In a package this would be in the __init__ function:
LightInterface.INTERFACE_FUNCTIONALITY[] = LightInterface.Live()
LightInterface.matrix(t, ::Live) = Tables.matrix(t)
end
IMPORTING THE FULL INTERFACE ENABLES FUNCTIONALITY:
julia> using .Interface
julia> double(t)
3×2 Array{Float64,2}:
0.195865 0.0202382
1.30479 0.329098
0.150455 1.7458
After review of the core dev team, it has been decided to investigate a POC for the following proposed re-organization:
Anything in ScientificTypes that is convention-dependent is moved into a new repo, MLJScientificTypes, which will implement the current default MLJ convention. Designating a Tables.jl table as having scitype Table{K}
withK=..
is to be regarded as part of the MLJ convention, which allows ScientificTypes to have zero dependencies and be extremely lightweight.
A barebones model interface is sunk out of MLJBase into a new repo, MLJModelnterface, which has the new ScientificTypes repo as it's sole dependency. Dependency on methods such the Table
constructor form MLJScientificTypes, and matrix
from Tables.jl, are removed with the method outlined in the previous comment.
As a datapoint from ScikitLearnBase.jl, almost every package developer I approached with a PR to support its interface complained about adding another dependency, and I had to insist that it's just 50 lines of code, with no extra dependency. So I think it's very important indeed for wider adoption.
This is now done:
coerce
and autotype
logic)MLJBase has been adapted to work with these new packages.
the latest release (0.3) of MLJLinearModels.jl shows an example of how to interface with MLJ using the MLJModelInterface (only).
We still need to update MLJ, MLJModels and MLJTutorials to use the new machinery but it should be done soon.
The developers who would like their package to be available in MLJ should now implement the MLJModelInterface directly in their package; it will be the only type of interface with MLJ that we will consider for merging in the model registry.
Since all of this is still new, there may be a few wrinkles here and there, your feedback is very welcome.
Following the discussion with JLBoost, I'm not super happy with the status quo.
At the moment, package devs who want to interface with MLJ from their package have to load MLJBase to do so. They may not want to do this essentially because of MLJBase loading time which is now around 4 seconds on my machine with Julia 1.3. It has increased because of the requests of some users to shift more features from MLJ to MLJBase resulting in MLJBase becoming a "big" repo with an increased loading time.
Some users may want to benefit from all that MLJBase offers (e.g. GeoStats) but most may not want to bother with it and leave the integration to MLJ (e.g. JLBoost).
We can encourage package devs to use
Requires
but Requires comes with a bunch of problems in terms of loading time and compat management.So we're back in square one in a way: initially I thought MLJBase was bound to be a very light package with negligible loading time for package devs who wanted to interface with MLJ, but that's not the case (anymore). So it seems there is still a need for such a package.
Suggestion
We could add a package
MLJAbstract
with no functionality, only the super barebone stuff needed for someone to write an interface with MLJ and nothing else. That way includingMLJAbstract
would ideally just take a fraction of a second.In particular, that package would have no dependency itself, not even ScientificTypes (which takes 1.5 sec to load). The gaps would be filled with the
@load
macro.Possible implementation
An interface should define:
fit
,predict
/transform
The model struct should be a subtype of
Deterministic
orProbabilistic
orUnsupervised
so these things should be available inMLJAbstract
The functions can just be declared as empty
function fit end
. Devs could add traits to indicate the type of theirX
andy
infit
(see below) and MLJ would do the conversion if that's the case, otherwise it is assumed that the dev is happy with getting a table and vector/categorical vector.Finally for the metadata, the only non-immediate thing that I can see is
input
,target
andoutput
for which a scientific type has to be provided, of course that would in theory require to loadScientificTypes
but that would defeat the purpose. A way around it is to define things likeThen the metadata can be provided as
Table(Continuous)
which is an Expr with the understanding that theExpr
would be resolved (evaluated) by MLJ.Here is a sketch of how things could work:
Of course a lot of this could be polished etc, but first we need to see whether this is an acceptable way forward or not. If it is, I'm happy to complete the suggestion above with a full POC; eventually then MLJBase would itself import MLJAbstract.