cstjean / ScikitLearn.jl

Julia implementation of the scikit-learn API https://cstjean.github.io/ScikitLearn.jl/dev/
Other
547 stars 75 forks source link

Writing your own estimators in Julia #87

Closed leonardtschora closed 4 years ago

leonardtschora commented 4 years ago

Hi everyone, this post is an issue with its solution but I would first like to share it with other people and second get insights about what I did wrong in my code.

As the title says, this is about creating Scikit Learn estimators using exclusively Julia programming.

Why?

Because I have strong feature extraction tools and models developped in Julia. I spend time optimizing the computations and I don't want to roll back to Python after than. However, I need ScikitLearn's tools for model selection (grid search, cv, etc...) and pipelines. Estimators are particularly handy for my use case and I need to be able to wrap my models, feature extraction tools in estimators to be able to use them efficiently.

How?

There were 2 solutions : -Develop estimators in Python and use pyjulia to call my julia core code -Use PyCall and especially the @pydef macro to write Python classes in Julia.

I selected the first option because this keeps my entiere project without any part writen in Python.

The code

The main difficulties arose from (at least I suppose they do) problems from passing from Julia to Python and Python to Julia. For instance, the get_params() and set_params() functions will not work properly (I already described this issue here). You can check the LeakyTransformer class to experience this error.

To solve most of these problems, I created boilerplate class for all Julia defined estimators : JuliaEstimator, which redefines get_params() and set_params() and propose a default constructor.

To finish with, you can try the class FunctionTransformer which wraps a julia function in an estimator and can be fully used in grid searches and pipelines.

using ScikitLearn, PyCall, Statistics
using ScikitLearn.Pipelines: Pipeline
using ScikitLearn.GridSearch: GridSearchCV

@sk_import base: BaseEstimator
@sk_import base: TransformerMixin
@sk_import base: RegressorMixin

# Define a function in Julia to put into a ScikitLearn estimator
f(X, a, b) = @. a * X + b

"""
Boilerplate code for all Julia-defined sklearn estimators
Somehow get_params and set_params don't work properly with Julia defined estimators
so we have to redefine them.

The solution is quite ugly because each input parameters are stored twice. 
The reason for that are :
    -Grid Search CV does not allow for positional or regular arguments, only keyword arguments
     (this migth be a python to julia problem).
    -There is no way in Julia to retrieve all passed arguments in a function at once. 
     using kwargs allow this and the init function becomes less tedious and works for all 
     estimator signatures.
    -It makes the get_params method realy easy to implement since it's just returning the parmaeter dict.

These choices are questionable and a good alternative is to manually define and set all
arguments in the __init__ function and to manulay define get_params and set_params
for every julia defined estimator. 
"""
@pydef mutable struct JuliaEstimator <: BaseEstimator
    function __init__(self; kwargs...)
        self.__kwargs = deepcopy(kwargs)
        for (k, v) in kwargs
            setproperty!(self, k, v)
        end
    end

    function get_params(self; deep=false)
        return self.__kwargs
    end

    function set_params(self; kwargs...)
        new_kwargs = deepcopy(self.__kwargs)
        for (k, v) in kwargs
            setproperty!(self, k, v)            
            new_kwargs[String(k)] = v
        end
        self.__kwargs = new_kwargs
        return self
    end
end

jestimator = JuliaEstimator(; a=1, b=2, c=3, d=4)
jestimator.get_params()
jestimator.set_params(; a=10, b=20)
jestimator.get_params()

"""
A dumb estimator that:
    learns the mean of the data
    apply a linear transformation to the data
    predict the first column of the tranformed data

No sens at all, just to check if everything works.        
"""
@pydef mutable struct FunctionTransformer <: (RegressorMixin, TransformerMixin, JuliaEstimator)
    function fit(self, X, y=nothing)
        self.a_ = mean(X)
        return self
    end

    function transform(self, X)
        return f(X, self.a_, self.b)
    end

    function predict(self, X)
        return self.transform(X)[:, 1] 
    end
end

# Test all sklearn functionalities
function test(transformer)
    # Try the transformer
    X = zeros(Int, 30, 2)
    y = zeros(Int, 30)
    fit!(transformer, X)
    transform(transformer, X)
    fit_transform!(transformer, X)

    # Try it in a pipeline
    pipe = Pipeline([("Dummy", transformer)])
    fit!(transformer, X)
    transform(transformer, X)

    # Try geting params
    transformer.get_params()
    transformer.set_params(; b=2)

    # Try putting in a grid search
    transformer.predict(X)
    transformer.score(X, y)
    grid = GridSearchCV(FunctionTransformer(), Dict("b" => [-2, 3, 0, 40]))
    fit!(grid, X, y)
    grid.best_params_
end

transformer  = FunctionTransformer(; b=-1)
test(transformer)

@pydef mutable struct LeakyTransformer <: (RegressorMixin, TransformerMixin, BaseEstimator)
    function __init__(self; b=0)
        self.b = b
    end

    function fit(self, X, y=nothing)
        self.a_ = mean(X)
        return self
    end

    function transform(self, X)
        return f(X, self.a_, self.b)
    end

    function predict(self, X)
        return self.transform(X)[:, 1] 
    end
end

ptransformer  = LeakyTransformer(; b=-1)
ptransformer.get_params()

Conclusion

I have successfully implemented and used an estimator containing Julia code. Some of the design choices I made are highly questionable (kwargs getion for instance) but I hope that at least this bit of code will provide a solid example of how to create sklearn estimators in Julia.

cstjean commented 4 years ago

Thanks for the write-up! Maybe I'm misunderstanding, or maybe the docs are not clear, but why didn't you just implement the ScikitLearnBase interface? If you don't need Python models, then your code can really be 100% Python-free, that was the goal.

leonardtschora commented 4 years ago

Hum indeed it seems way more simple to juste use ScikitLearnBase.jl ... I don't recall seeing ScikitLearnBase.jl bein mentionned in ScikitLearn.jl documentation.

For the records, the above code translate to this: (which is way less ugly):

import ScikitLearnBase: fit!, transform, predict
using ScikitLearnBase: get_params, set_params!, @declare_hyperparameters, BaseRegressor

f(X, a, b) = @. a * X + b
mutable struct SKBEstimator <: BaseRegresor
    a::Int
    b_::Float64
    SKBEstimator(; a=0) = new(a)
end

@declare_hyperparameters(SKBEstimator, [:a])
function fit!(model::SKBEstimator, X, y)
    model.b_ = sum(X) / length(X)
    return model
end

function transform(model::SKBEstimator, X)
    f(X, model.b_, model.a)
end

function predict(model::SKBEstimator, X)
    return transform(model, X)[:, 1]
end

X = zeros(Int, 30, 2)
y = zeros(Int, 30)

skbt = SKBEstimator()
fit!(skbt, X, y)
transform(skbt, X)

# Try it in a pipeline
pipe = Pipeline([("Dummy", skbt)])
fit!(pipe, X, y)
transform(pipe, X)

# Try geting params
get_params(skbt)
set_params!(skbt; a=2)
get_params(skbt)

# Try score & predict
predict(skbt, X)
score(skbt, X, y)

# Try putting in a grid search
grid = GridSearchCV(SKBEstimator(), Dict(:a => [-2, 3, 0, 40]))
fit!(grid, X, y)
grid.best_params_

And now a quick question: what is the purpose of "Inheriting" BaseClassifier vs defining is_classifier? And more generaly, which default method do BaseEstimator, BaseRegressor and BaseClassifer implements? (I think I spoted the score function for BaseRegressor.

Thanks a lot

cstjean commented 4 years ago

Yeah, you don't particularly need to inherit from these classes.

leonardtschora commented 4 years ago

What do you mean? These classes actually do nothing?

cstjean commented 4 years ago

Well, they do what you said: specify is_classifier and score to give good defaults. If you can, inherit, but if you can't (because you want to inherit from something else), then it's not a big problem to write is_classifier(::YourEstimator)

leonardtschora commented 4 years ago

Okay thanks :)