Create MLJAbstract package only for interfacing with MLJ

Following the discussion with JLBoost, I'm not super happy with the status quo.

At the moment, package devs who want to interface with MLJ from their package have to load MLJBase to do so. They may not want to do this essentially because of MLJBase loading time which is now around 4 seconds on my machine with Julia 1.3. It has increased because of the requests of some users to shift more features from MLJ to MLJBase resulting in MLJBase becoming a "big" repo with an increased loading time.

Some users may want to benefit from all that MLJBase offers (e.g. GeoStats) but most may not want to bother with it and leave the integration to MLJ (e.g. JLBoost).

We can encourage package devs to use Requires but Requires comes with a bunch of problems in terms of loading time and compat management.

So we're back in square one in a way: initially I thought MLJBase was bound to be a very light package with negligible loading time for package devs who wanted to interface with MLJ, but that's not the case (anymore). So it seems there is still a need for such a package.

Suggestion

We could add a package MLJAbstract with no functionality, only the super barebone stuff needed for someone to write an interface with MLJ and nothing else. That way including MLJAbstract would ideally just take a fraction of a second.

In particular, that package would have no dependency itself, not even ScientificTypes (which takes 1.5 sec to load). The gaps would be filled with the @load macro.

Possible implementation

An interface should define:

a valid model struct + constructor
a few functions: fit, predict/transform
metadata

The model struct should be a subtype of Deterministic or Probabilistic or Unsupervised so these things should be available in MLJAbstract

The functions can just be declared as empty function fit end. Devs could add traits to indicate the type of their X and y in fit (see below) and MLJ would do the conversion if that's the case, otherwise it is assumed that the dev is happy with getting a table and vector/categorical vector.

Finally for the metadata, the only non-immediate thing that I can see is input, target and output for which a scientific type has to be provided, of course that would in theory require to load ScientificTypes but that would defeat the purpose. A way around it is to define things like

const Continuous = :Continuous
const Count = :Count
Table(T::Symbol) = :(MLJBase.Table(eval($T)))

Then the metadata can be provided as Table(Continuous) which is an Expr with the understanding that the Expr would be resolved (evaluated) by MLJ.

Here is a sketch of how things could work:

module MLJAbstract

    # of course  we'd get rid of that, it's just  so  that I  can import
    # the type system here
    import MLJBase
    import MLJBase: Model, Supervised, Unsupervised, Probabilistic, Deterministic

    # ## Type system NOTE: would be copied from MLJBase
    #
    # abstract type MLJType end
    # abstract type Model <: MLJType end
    #
    # abstract type Supervised   <: Model end
    # abstract type Unsupervised <: Model end
    #
    # abstract type Probabilistic <: Supervised end
    # abstract type Deterministic <: Supervised end

    ## Function structure

    export fit, predict, transform

    function fit end
    function predict end
    function transform end

    # and rest like clean!, fitted_params etc

    ## Metadata tricks

    const Continuous    = :Continuous
    const Count         = :Count
    const Multiclass    = :Multiclass
    const OrderedFactor = :OrderedFactor

    # etc...

    Union(T::Symbol...) = :(Union{eval.($T)...})
    Table(T::Symbol)    = :(MLJBase.Table(eval($T)))
    Vector(T::Symbol)   = :(AbstractVector{eval($T)})

    # meta

    input_scitype(m::Model)  = nothing
    target_scitype(m::Model) = nothing

    # if true, MLJBase.matrix(X) called before passing data to model
    matrix_input(m::Model)       = false
    # if false, MLJBase.int(X) called before passing data to model
    categorical_target(m::Model) = true

end # module

module ThePackage
    import ..MLJAbstract

    const MLJ = MLJAbstract

    ## STRUCT + CONSTR

    mutable struct LinReg <: MLJ.Deterministic
        fit_intercept::Bool
    end

    LinReg(;fit_intercept=true) = LinReg(fit_intercept)

    ## FIT + PREDICT/TRANSFORM

    function MLJ.fit(m::LinReg, verbosity::Int, X, y)
        if m.fit_intercept
            X = hcat(X, ones(size(X, 1)))
        end
        coefs = X \ y
        return coefs, nothing, NamedTuple()
    end

    function MLJ.predict(m::LinReg, coefs, Xnew::Matrix)
        if m.fit_intercept
            Xnew = hcat(Xnew, ones(size(Xnew, 1)))
        end
        return Xnew * coefs
    end

    ## META (using foo functions to simplify things)
    # the *_type are the expected machine type, nothing if aligned with MLJ.

    MLJ.input_scitype(::LinReg) = MLJ.Table(MLJ.Continuous)
    # union is a bit silly but it's for the example
    MLJ.target_scitype(::LinReg) = MLJ.Vector(Union(MLJ.Continuous, MLJ.Count))

    MLJ.matrix_input(::LinReg) = true
end

# TESTING stuff

using MLJBase

MA = MLJAbstract
P  = ThePackage

lr = P.LinReg()

function machine2(m::M, args...) where M <: MA.Model
    # see if need to convert things
    convinput  = MA.matrix_input(m)
    convtarget = M <: MA.Supervised && !MA.categorical_target(m)
    # otherwise just use standard machine call
    if !convinput && !convtarget
        machine(m, args...)
    end
    # need to convert stuff, NOTE: can add logic to check scitype here
    X = args[1]
    if convinput
        # NOTE: no transpose here, leave that up to package dev in their
        # interface
        X = MLJBase.matrix(X)
    end
    if M <: MA.Supervised
        y = args[2]
        if convtarget
            y = MLJBase.int(X)
        end
        return MLJBase.Machine(m, X, y)
    end
    return MLJBase.Machine(m, X)
end

X, y = MLJBase.make_regression()

MA.fit(lr, 1, MLJBase.matrix(X), y)

mach = machine2(lr, X, y)

Of course a lot of this could be polished etc, but first we need to see whether this is an acceptable way forward or not. If it is, I'm happy to complete the suggestion above with a full POC; eventually then MLJBase would itself import MLJAbstract.

eventually then MLJBase would itself import MLJAbstract

Thank you @tlienart for the careful design. At a first glance, this approach works nicely and fits all devs needs. Basically, MLJModelInteface.jl (my proposal name) would be a package that is imported by MLJBase.jl and reexported so that devs interested in implementing a model would depend on MLJModelInterface.jl and devs interested in implementing more integration with the stack would depend on MLJBase.jl as it is currently done. That is the idea, right?

The model struct should be a subtype of Deterministic or Probabilistic or Unsupervised so these things should be available in MLJAbstract

I think we can do even better now that we have this opportunity to rethink the model interface. We can consider not forcing devs to subtype their own types with types defined in the stack. Perhaps, it is possible to safely exploit Julia's trait system so that we get extensibility without inheritance. There is a nice blog post I can't find now by @oxinabox explaining how to achieve generics in Julia with traits. It is something to think about. I am considering it seriously for my own types in GeoStats.jl for example because I want the methods implemented there to work on any domain type not implemented by myself and after the fact. By just implementing the right methods, users can send their domain types, models, etc to the stack without having to build wrapper types just to inherit from a type defined in our packages.

More specifically, it seems to me that the types Supervised, Unsupervised, Probabilistic, Deterministic are actually traits in a more general and extensible trait system. Unless they are strictly needed to facilite dispatch internally in MLJ, we could consider making them traits like:

issupervised(model::MLJModel) = false
isprobabilistic(model::MLJModel)= false
...

I actually did that in GeoStatsBase.jl here: https://github.com/juliohm/GeoStatsBase.jl/blob/master/src/learning/traits.jl

Now someone that has a type already implemented with a non-compatible interface can just add these two lines and have the model integrated with MLJ. Makes sense? No need to create a wrapper type to inherit from Supervised or Probabilistic. Also since Julia doesn't support multiple inheritance this gives us more expressivity to categorize models according to multiple criteria.

Before committing us on a course of action I should like to give this some more thought. However, my initial reaction to the proposal is postive, with the following comments:

@tlienart I don't like the idea of a "shadow" ScientificTypes within the interface package; I think there should be a cleaner solution. How about we instead split ScientificTypes into ScientificTypes and ScientificTypesBase? The base package would just have the type definitions, and the functionality that is independent of any particular convention. This removes all the heavy lifting and will leave Tables.jl as it's sole dependency. Tables.jl takes 0.6 seconds to load on my machine. (Currently ScientificTypes takes 2.1 seconds.) The new MLJModelInterface package imports ScientificTypesBase, which is all it would need.
(edit) @tlienart I'm assuming MLJBase/src/model_traits.jl, which defines stubs/fallbacks for the model traits, would go intoMLJModelInterface, under your proposal?
@juliohm I agree that traits (which MLJ uses extensively already) would be better for Model and its subtypes, Supervised, and so forth. However, this represents a significant refactoring of code, spread over several repos. I'm not sure I could justify devoting core dev team resources to this at this point. If we had a suitable volunteer familiar the code and with plenty of free time... Or if Google donates one million dollars to the project... 😄

Before committing us on a course of action I should like to give this some more thought.

definitely

@tlienart I don't like the idea of a "shadow" ScientificTypes within the interface package; I think there should be a cleaner solution. How about we instead split ScientificTypes into ScientificTypes and ScientificTypesBase? The base package would just have the type definitions, and the functionality that is independent of any particular convention. This removes all the heavy lifting and will leave Tables.jl as it's sole dependency. Tables.jl takes 0.6 seconds to load on my machine. (Currently ScientificTypes takes 2.1 seconds.) The new MLJModelInterface package imports ScientificTypesBase, which is all it would need.

IMO even loading Tables is too much, I understand what you're saying but it seems too much to ask to devs to load Tables just because it makes our lives easier if they otherwise don't work with tables in their package but, for instance, just with bare matrices (the case of MLJLM for instance).

I think the "shadow typing" would be pretty easy to maintain and, inherently, not very different from having a "ScientificBase" package (except much faster as it doesn't require Tables). It may seem a bit ad-hoc but it seems like a small price to pay for increased adoption by devs.

(edit) @tlienart I'm assuming MLJBase/src/model_traits.jl, which defines stubs/fallbacks for the model traits, would go intoMLJModelInterface, under your proposal?

Not necessarily. We just need functionalities to allow the dev to declare the traits in some prescribed way assuming it will be interpreted by MLJBase; whatever as long it's very lightweight.

Finally we could offer a MLJBase.test_implementation(m) that checks everything is fine which devs would put in their tests (where they would, actually, load MLJBase, but only there).

Side note: Implementers of probabilistic classifiers need Distributions because UnivariateFinite defined in MLJBase subtypes Distributions.Distribution. This package is quite large but also pretty much essential for any serious statistical software that does not define their own. I'm guessing this is considerably bigger than Tables.

@tlienart I have a general trick to remove dependencies in any interface, when these dependencies are needed only for functions (not types). In our case this suffices, because the only types that need to be referenced in MLJModelInterface are the ScientificTypes, which we can move into a ScientificTypesTypesOnlyPackage which will have trivial overhead.

Here's the general trick: First, define a mutable flag INTERFACE_FUNCTIONALITY in the lightweight interface package - let's call it Light.jl - which is for declaring whether we are in Dummy() or Live() mode, and to set it there to Dummy().

in LightInterface.jl:

abstract type Mode end
struct Dummy <: Mode end
struct Live <: Mode end
const  INTERFACE_FUNCTIONALITY = Ref{Mode}(Dummy())

Any function we want from some dependent package - matrix from Tables.jl, say - we define right there in LightInterface.jl but dispatch it on the flag value, giving it a dummy return value when the flag is Dummy().

in Client package:

matrix(t) = matrix(t, INTERFACE_FUNCTIONALITY[])
matrix(t, ::Dummy) = error("Only `LightInterface` loaded. Do `import Interface`.")

Any client wanting to implement the interface can now import LightInterface and make calls like matrix(t) (and t could be anything). Of course executing the client's code will fail at this point. However, we fix this in the full interface package, let's call it Interface.jl.

First, the __init__ function of Interface.jl resets the flag to Live(). Since Interface imports LightInterface, the flag will never be Dummy() so long as Interface has been imported. Second, we complete the definition of matrix in the Live() case, now that Tables.jl is available (Tables.jl being a dependency of Interface.jl).

in Interface.jl:

function __init__()
    LightInterface.INTERFACE_FUNCTIONALITY[] = LightInterface.Live()
end

using ..LightInterface
import Tables  # the elephant LightInterface didn't want

LightInterface.matrix(t, ::Live) = Tables.matrix(t)

Now, if the client package and Interface.jl are both loaded, it can actually be used.

What do you think?

Here's a proof of concept:

THE LIGHT INTERFACE

module LightInterface

export Dummy, Live
export matrix

abstract type Mode end
struct Dummy <: Mode end
struct Live <: Mode end
const  INTERFACE_FUNCTIONALITY = Ref{Mode}(Dummy())

matrix(t) = matrix(t, INTERFACE_FUNCTIONALITY[]) 
matrix(t, ::Dummy) =
    error("Only `LightInterface` loaded. Do `import Interface`.")

end

MODULE IMPLEMENTING THE LIGHT INTERFACE

module Client

export double

using ..LightInterface

double(t) = 2*matrix(t)

end

CAN LOAD CLIENT CODE; JUST CANT USE IT

julia> using .Client
julia> t = (x1=rand(3), x2=rand(3)) # a table
julia> double(t)
ERROR: Only `LightInterface` loaded. Do `import Interface`.

THE FULL INTERFACE

module Interface

export matrix

using ..LightInterface
import Tables  # the elephant LightInterface didn't want

# In a package this would be in the __init__ function:
LightInterface.INTERFACE_FUNCTIONALITY[] = LightInterface.Live()

LightInterface.matrix(t, ::Live) = Tables.matrix(t)

end

IMPORTING THE FULL INTERFACE ENABLES FUNCTIONALITY:

julia> using .Interface
julia> double(t)
3×2 Array{Float64,2}:
 0.195865  0.0202382
 1.30479   0.329098 
 0.150455  1.7458

After review of the core dev team, it has been decided to investigate a POC for the following proposed re-organization:

Anything in ScientificTypes that is convention-dependent is moved into a new repo, MLJScientificTypes, which will implement the current default MLJ convention. Designating a Tables.jl table as having scitype Table{K} withK=.. is to be regarded as part of the MLJ convention, which allows ScientificTypes to have zero dependencies and be extremely lightweight.
A barebones model interface is sunk out of MLJBase into a new repo, MLJModelnterface, which has the new ScientificTypes repo as it's sole dependency. Dependency on methods such the Table constructor form MLJScientificTypes, and matrix from Tables.jl, are removed with the method outlined in the previous comment.

As a datapoint from ScikitLearnBase.jl, almost every package developer I approached with a PR to support its interface complained about adding another dependency, and I had to insist that it's just 50 lines of code, with no extra dependency. So I think it's very important indeed for wider adoption.

This is now done:

MLJModelInterface.jl (relevant for people who want to interface with MLJ, it has a unique dependency: ScientificTypes and loads in ~0.06s on my machine)
ScientificTypes.jl is now extremely bare-bone and should be considered as an "abstract package", an implementation is provided in MLJScientificTypes
MLJScientificTypes.jl an implementation of ScientificTypes that can be used for data pre-processing (contains for instance the coerce and autotype logic)

MLJBase has been adapted to work with these new packages.

the latest release (0.3) of MLJLinearModels.jl shows an example of how to interface with MLJ using the MLJModelInterface (only).

Next steps

We still need to update MLJ, MLJModels and MLJTutorials to use the new machinery but it should be done soon.

The developers who would like their package to be available in MLJ should now implement the MLJModelInterface directly in their package; it will be the only type of interface with MLJ that we will consider for merging in the model registry.

Since all of this is still new, there may be a few wrinkles here and there, your feedback is very welcome.

JuliaAI / MLJ.jl