JuliaAI / MLJBase.jl

Core functionality for the MLJ machine learning framework
MIT License
160 stars 45 forks source link

Implement the MLJ model API without needing to depend on external dependencies such as CSV.jl, CategoricalArrays.jl, etc. #19

Closed DilumAluthge closed 5 years ago

DilumAluthge commented 5 years ago

As far as I can tell, in order for me to implement the MLJ model API, I need to import MLJBase.jl.

While MLJBase.jl is a more lightweight dependency than MLJ.jl, it still does have quite a few dependencies. I would rather not have to depend on CSV.jl, CategoricalArrays.jl, Tables.jl, etc. in order to be able to implement the MLJ model API.

Take this modified version of the simple deterministic regressor example:

import MLJBase
using LinearAlgebra

mutable struct MyRegressor <: MLJBase.Deterministic
    lambda::Float64
end
MyRegressor(; lambda=0.1) = MyRegressor(lambda)

# fit returns coefficients minimizing a penalized rms loss function:
function MLJBase.fit(model::MyRegressor, X, y)
    fitresult = (X'X - model.lambda*I)\(X'y)  # the coefficients
    return fitresult
end

# predict uses coefficients to make new prediction:
MLJBase.predict(model::MyRegressor, fitresult, Xnew) = Xnew*fitresult

I can define this entire model without using any dependencies. Unfortunately, because I need to import MLJBase.jl, I still end up depending on all of MLJBase.jl's dependencies.

The JuliaData people have solved this problem by creating the DataAPI.jl package. DataAPI.jl is a tiny package that has no dependencies and provides the namespace for the JuliaData API.

Would you be open to creating a similar MLJapi.jl package? The package would be very simple. It would have no dependencies, and its only content would consist of type definitions and function stubs, for example:

abstract type MLJType end
abstract type Model <: MLJType end
abstract type Supervised <: Model end 
...

function fit end
function update end
function predict end
...

MLJ.jl, MLJBase.jl, MLJModels.jl, etc. would import MLJapi.jl and extend its functions.

This would allow other package authors to implement the MLJ model API without needing to depend on all of MLJBase's dependencies.

Would you be willing to adopt this approach? If so, I'd be more than happy to help create the MLJapi.jl package.

giordano commented 5 years ago

This (long) discussion can help understand the context of where DataAPI originated from: https://discourse.julialang.org/t/proposal-for-sharedfunctions-jl-package-for-optional-dependency-management/23526

DilumAluthge commented 5 years ago

cstjean's comment in that discussion thread basically sums up what I want to do:

I do believe that this issue ought to be solved in base, though. There should be a way to define a LightGraphs.cartesian_product method without importing LightGraphs.

I want to define methods for MLJBase.fit, MLJBase.predict etc. without needing to import MLJBase.jl and thus require all of its dependencies.

Since there is currently no support in base Julia for this, I think that the next best approach would be to create an MLJapi.jl package with no dependencies, thus allowing me to define methods for MLJapi.fit, MLJapi.predict etc.

Basically I want to be able to implement the MLJ model API without needing to depend on any non-standard library dependencies. Is there another way that this would be possible?

DilumAluthge commented 5 years ago

As another example, https://github.com/cstjean/ScikitLearnBase.jl has no dependencies, which allows package developers to implement the ScikitLearn.jl API without needing to depend on any additional dependencies.

ablaom commented 5 years ago

@DilumAluthge Thanks for your post and the offer of help. Happy to see how we can improve things dependency wise.

I don't think a separate MLJapi helps. MLJBase is supposed to do more-or-less what you describe. Apart from CSV and Distributions, MLJBase is already fairly lightweight. If I remove the standard library packages I have:

CSV

CategoricalArrays - Lightweight and any model dealing with categorical data needs to load this

ColorTypes - FixedPointNumbers is the only dependency, which itself has none

Distributions - Unfortunately, a large package but dependencies don't look too bad. Any model that makes probabailistic predictions will need to load this

StatsBase - We extend the fit, predict and fit! from here. Hard to imagine a data science project that will not import this.

Tables - Very lightweight and essential because the "X" in the fit and predict extended by a new model is a Tables.jl generic table (not a matrix as your model assumes). Importing this means we don't need to import any heavyweight tabular data package.

We can certainly remove CSV. It is only there to provide a few built-in data sets for testing purposes. Before dumping it we must refactor a lot of model implementation test code. This code lives in MLJ/test files and MLJModels/test files. The MLJBase function calls to replace are:

load_boston()

load_crabs()

load_iris()

load_reduced_ames()

load_ames() don't think this is actually used

datanow()

Note that all of these except the last return the data wrapped as a task, but in the tests that use these methods, only the raw data X, y is actually used.

We could replace the above with datasets from RDatasets, which already has iris and crabs, I think. We would need to choose replacements for the others. I don't think we want a separate MLJDataseets package!

@DilumAluthge How keen would you be on making a PR refactoring the test code along these lines?

DilumAluthge commented 5 years ago

@ablaom Instead of removing CSV entirely, how about making it an optional dependency, and loading the code in src/datasets.jl conditionally with Requires? Requires is a much lighter dependency than CSV.

I've made the following pull requests:

ablaom commented 5 years ago

Actually I think that's a good idea. I have a slight misgiving as Mike Innes (Requires author) once told me "Requires is really a hack" and advised against using it as a permanent solution. But you have already done the work and the refactoring option could be done later.

Obviously the travis tests on MLJ and MLJModels do not see the changes proposed on MLJBase. Have you tested the three updated repos work together locally?

DilumAluthge commented 5 years ago

Yep, I just checked out the da/csv-optional branch on all three repos locally and ran the tests, and the package tests pass for all three packages!

For what it's worth, I think that Requires isn't considered hacky anymore. Thanks to refactoring by Mike Innes and Tim Holy, and the addition of Base.package_callbacks into base Julia, I think Requires is now a "clean and straightforward solution" (to quote the discussion in https://github.com/JuliaLang/julia/issues/2025).

ablaom commented 5 years ago

Great!

So I will:

Edited: Not necessary to introduce new [compat] to MLJ and MLJModels. Will work fine with old version of MLJBase after their respective PR's are pulled.