cstjean / ScikitLearn.jl

Julia implementation of the scikit-learn API https://cstjean.github.io/ScikitLearn.jl/dev/
Other
546 stars 75 forks source link

ScikitLearn declared inside a module causes segmentation error #50

Closed ppalmes closed 2 weeks ago

ppalmes commented 5 years ago

Tested in Julia 1.0.3 and Julia 1.1 and Julia 0.7

To recreate the problem: create package A pkg] generate A bash> cd A pkg] activate . pkg] add ScikitLearn julia> edit("src/A.jl") ----- module A using ScikitLearn @sk_import linear_model: LogisticRegression

function testme() model = LogisticRegression() end

end --- julia> using A julia> A.testme() -> causes segmentation error

However, if you use: julia> include("src/A.jl") julia> A.testme() -> works

ppalmes commented 5 years ago

The purpose for this thing is to create a wrapper to have common API between caret of RCall and scikitlearn of PyCall

cstjean commented 5 years ago

Thank you for the report. I've hit several segmentation faults myself in Julia 1.1, but this is the first occurrence with ScikitLearn. It might be interesting to reduce it as much as possible and report that to Julialang. That said, it's not surprising that @sk_import doesn't work inside a module, although it really should be documented and warned against. The proper way is documented in PyCall. Use @macroexpand @sk_import ..., and it should be clear.

ppalmes commented 5 years ago

Thanks for the reply. I resolved it by adding: __precompile__(false) in the main module. It seems that precompiling is the issue and this also happens with PyCall.

ppalmes commented 5 years ago

I think the bug occurred when they decided to make precompiling as the default which was changed in this PR: https://github.com/JuliaLang/julia/issues/26282

cstjean commented 5 years ago

Yeah. It's that when precompiling, it stores the values of all global variables. @sk_import or @pyimport create a global that holds a pointer to the loaded Python module. Obviously, storing that in the precompiled code file is a very bad idea. When reloading your module, it reads the pointer address, which now points to whatever, thus -> segfault. That's why you need the __init__ trick if you want precompilation.

Red-Portal commented 4 years ago

For anyone who's still wondering here's a snippet that worked for me. Note that the loading is done by PyCall but the rest is done using the ScikitLearn.jl API.

using ScikitLearn
using PyCall
const mixture = PyNULL()
function __init__()
    copy!(mixture, pyimport("sklearn.mixture"))
end

### Usage
    gmm_config = mixture.BayesianGaussianMixture(n_components=10,
                                                 max_iter=1000,
                                                 weight_concentration_prior=1.0)
    model = fit!(gmm_config, samples[:,:])
    w = model.weights_[:,1,1]
    μ = model.means_[:,1,1]
    σ = sqrt.(model.covariances_[:,1,1])
###