cstjean / ScikitLearn.jl

Julia implementation of the scikit-learn API https://cstjean.github.io/ScikitLearn.jl/dev/
Other
544 stars 76 forks source link

Scalers #88

Open leonardtschora opened 4 years ago

leonardtschora commented 4 years ago

Hi everyone, I'm starting to work around the use of ScikitLearn in Julia.

In my understanding, there are a few models and tools curently implemented in Julia and the rest of the code are bindings to Python function. It is possible to use all of Scikit's functions via the @py_import macro.

If I'm not mistaken, the scalers have not been ported yet to Julia.

I made this quick workaround to implement my own Scaler class in Julia and it seems that they are way faster (which is why we are using Julia?). My Scaler class is far from being complete (no keywords arguments), but it seems that there exists such scalers JuliaML.

Is there a reason why Scalers (and the question could be extended to a lot of other tools) are not currently in ScikitLearn.jl? Thanks again for your time.

using Statistics, ScikitLearn, ScikitLearnBase, BenchmarkTools
import ScikitLearnBase: fit!, transform, inverse_transform
@sk_import preprocessing: StandardScaler

"""
A Julia standardScaler
"""
mutable struct JStandardScaler <: BaseEstimator
    epsilon::Float64

    mean_::Matrix
    std_::Matrix
    real_std_::Matrix
    JStandardScaler(; epsilon=0.001) = new(epsilon)
end

function fit!(model::JStandardScaler, X, y=nothing)
    model.mean_ = mean(X, dims=1)
    model.real_std_ = std(X, dims=1)
    model.std_ = map(model.real_std_) do x
        x > model.epsilon && return x
        return model.epsilon
    end
    return model
end

function transform(model::JStandardScaler, X)  
    return @. (X - model.mean_) / model.std_
end

function inverse_transform(model::JStandardScaler, X)
    return @. X * model.std_ + model.mean_
end

n = Int(10e6)
X = rand(Int, n, 12)

julia_scaler = JStandardScaler()
fit!(julia_scaler, X)
X_ = transform(julia_scaler, X)
X__ = inverse_transform(julia_scaler, X_)
@assert isapprox(X, X__)

python_scaler = StandardScaler()
fit!(python_scaler, X)
X_ = transform(python_scaler, X)
X__ = inverse_transform(python_scaler, X_)
@assert isapprox(X, X__)

julia_scaler = JStandardScaler()
@btime begin
    fit!($julia_scaler, $X)
    X_ = transform($julia_scaler, $X)
    X__ = inverse_transform($julia_scaler, $X_)
end

python_scaler = StandardScaler()
@btime begin
    fit!($python_scaler, $X)
    X_ = transform($python_scaler, $X)
    X__ = inverse_transform($python_scaler, $X_)
end
cstjean commented 4 years ago

Is there a reason why Scalers (and the question could be extended to a lot of other tools) are not currently in ScikitLearn.jl?

Just lack of time! If you would like to contribute them, that would be a very nice PR.

Meanwhile, as happy as I am to see interest in ScikitLearn.jl... Have you checked out MLJ.jl? It is very actively developed. Unless someone steps up to push it further, ScikitLearn.jl will continue its life as a "gateway package", easing Python users into a new ecosystem.

leonardtschora commented 4 years ago

If you would like to contribute them, that would be a very nice PR.

I don't know yet if I will have the time to make a nicer Scaler object. For now, I just need the very basic one.

I checked out MLJ, but it seems llike juste another wrapper/interface for 3rd parties MachineLearning packages. If it is performance-wise more efficient I migth switch to it but for now I prefer using the ScikitLearn's algorithms.

Thanks a lot for your help, I migth ask new questions soon.