cstjean / ScikitLearn.jl

Julia implementation of the scikit-learn API https://cstjean.github.io/ScikitLearn.jl/dev/
Other
546 stars 75 forks source link

GridSearchCV does not return training score #66

Open CBongiova opened 4 years ago

CBongiova commented 4 years ago

Hi,

I am trying to return the training score through GridSearchCV. Having a look at the ScikitLearn documentation I saw I should be normally be able to pass an input "return_train_score=true". However, when I try it in Julia, I get a method error.

grid_search = GridSearchCV(clf, param_grid,return_train_score=true)

Does anyone know how to retrieve train scores correctly? Thanks!

cstjean commented 4 years ago

Could you please post the error message? Some minimal code that demonstrates the issue would be appreciated as well!

CBongiova commented 4 years ago

@cstjean Thanks for replying

The code is actually pretty straight-forward (I took an example from the scikitlearn website):

`############# Grid search clf=RandomForestClassifier(n_estimators=100,class_weight="balanced_subsample",criterion="entropy",max_depth=30)

Utility function to report best scores

function report(grid_scores, n_top=3) top_scores = sort(grid_scores, by=x->x.mean_validation_score, rev=true)[1:n_top] for (i, score) in enumerate(top_scores) println("Model with rank:$i") @printf("Mean validation score: %.3f (std: %.3f)\n", score.mean_validation_score, std(score.cv_validation_scores)) println("Parameters: $(score.parameters)") println("") end end

use a full grid over all parameters

param_grid = Dict("max_features"=> [1, 6, 12], "min_samples_leaf"=> [1, 5, 10], "min_impurity_decrease" => [0,0.1,0.3], "min_samples_split"=> [2, 5, 10] )

run grid search

grid_search = GridSearchCV(clf, param_grid,return_train_score=true)

start = @elapsed begin fit!(grid_search, features_new, labels_new) end println("GridSearchCV took $start seconds")

report(grid_search.gridscores)`

I guess you can use the dataset from https://scikitlearnjl.readthedocs.io/en/latest/quickstart/ for testing.

The error message is :

MethodError: no method matching GridSearchCV(; estimator=PyObject RandomForestClassifier(bootstrap=True, class_weight='balanced_subsample',

                   criterion='entropy', max_depth=30, max_features='auto',
                   max_leaf_nodes=None, min_impurity_decrease=0.0,
                   min_impurity_split=None, min_samples_leaf=1,
                   min_samples_split=2, min_weight_fraction_leaf=0.0,
                   n_estimators=100, n_jobs=None, oob_score=False,
                   random_state=None, verbose=0, warm_start=False), param_grid=Dict{String,Array{T,1} where T}("min_samples_split" => [2, 5, 10],"min_impurity_decrease" => [0.0, 0.1, 0.3],"min_samples_leaf" => [1, 5, 10],"max_features" => [1, 6, 12]), return_train_score=true)

Closest candidates are: GridSearchCV(; estimator, param_grid, scoring, loss_func, score_func, fit_params, n_jobs, iid, refit, cv, verbose, errorscore, scorer, bestparams, bestscore, gridscores, bestestimator) at /Users/admin/.juliapro/JuliaPro_v1.2.0-1/packages/Parameters/l76EM/src/Parameters.jl:466 got unsupported keyword argument "return_train_score" GridSearchCV(!Matched::GridSearchCV; kws...) at /Users/admin/.juliapro/JuliaPro_v1.2.0-1/packages/Parameters/l76EM/src/Parameters.jl:528 GridSearchCV(!Matched::GridSearchCV, !Matched::AbstractDict) at /Users/admin/.juliapro/JuliaPro_v1.2.0-1/packages/Parameters/l76EM/src/Parameters.jl:531 got unsupported keyword arguments "estimator", "param_grid", "return_train_score" ... kwerr(::NamedTuple{(:estimator, :param_grid, :return_train_score),Tuple{PyObject,Dict{String,Array{T,1} where T},Bool}}, ::Type) at error.jl:125 (::getfield(Core, Symbol("#kw#Type")))(::NamedTuple{(:estimator, :param_grid, :return_train_score),Tuple{PyObject,Dict{String,Array{T,1} where T},Bool}}, ::Type{GridSearchCV}) at none:0

GridSearchCV#110(::Base.Iterators.Pairs{Symbol,Bool,Tuple{Symbol},NamedTuple{(:return_train_score,),Tuple{Bool}}}, ::Type{GridSearchCV}, ::PyObject, ::Dict{String,Array{T,1} where T}) at grid_search.jl:545

(::getfield(Core, Symbol("#kw#Type")))(::NamedTuple{(:return_train_score,),Tuple{Bool}}, ::Type{GridSearchCV}, ::PyObject, ::Dict{String,Array{T,1} where T}) at none:0 top-level scope at Train_ML.jl:480 include_string(::Module, ::String, ::String) at sys.dylib:? include_string(::Module, ::String, ::String, ::Int64) at eval.jl:30 (::getfield(Atom, Symbol("##127#132")){String,Int64,String,Bool})() at eval.jl:94 withpath(::getfield(Atom, Symbol("##127#132")){String,Int64,String,Bool}, ::String) at utils.jl:30 withpath at eval.jl:47 [inlined]

126 at eval.jl:93 [inlined]

with_logstate(::getfield(Atom, Symbol("##126#131")){String,Int64,String,Bool}, ::Base.CoreLogging.LogState) at logging.jl:395 with_logger at logging.jl:491 [inlined]

125 at eval.jl:92 [inlined]

hideprompt(::getfield(Atom, Symbol("##125#130")){String,Int64,String,Bool}) at repl.jl:85 macro expansion at eval.jl:91 [inlined] macro expansion at dynamic.jl:24 [inlined] (::getfield(Atom, Symbol("##124#129")))(::Dict{String,Any}) at eval.jl:86 handlemsg(::Dict{String,Any}, ::Dict{String,Any}) at comm.jl:164 (::getfield(Atom, Symbol("##19#21")){Array{Any,1}})() at task.jl:268

cstjean commented 4 years ago

Thank you for the bug report. My best guess is that return_train_score is a "new" parameter. ScikitLearn unfortunately lags behind scikit-learn python by a few years. I don't have time to look into it at the moment, but if you would like to make a pull request implementing it, it will be appreciated!

Rohp001 commented 4 years ago

Probably you should try GridSearchCV(return_train_score=True), with the T capital. Since, it takes an boolean input so, that is why the small case letter 't' is not working.

CBongiova commented 4 years ago

@Rohp001 I don't think this is the issue, the capital "T" is for python syntax. Julia's syntax uses lower-case "t" for boolean true.

Rohp001 commented 4 years ago

Oh, sorry, I didn't saw the language you were using. My bad!!!