GridSearch with pipelines of dataframes

mratsim commented 7 years ago

Hello again Cédric,

Following your help on transformer I am now trying to use a GridSearch to optimize the hyperparameters of a RandomForest.

I have a pipeline with lots of transformer which works great with Cross Validation and actual prediction, however I get a type error when trying to use it in a GridSearchCV, it seems like there is an extra argument of type ScikitLearn.Skcore.ParameterGrid in my setup :

pipe = Pipelines.Pipeline([ # This is working fine for cross validation, fitting and predicting
    ("extract_deck",PP_DeckTransformer()),
     ... # A list of 15 transformers
     ("featurize", mapper), # This is a DataFrameMapper to convert to Array
    ("forest", RandomForestClassifier(ntrees=200)) #Hyperparam: nsubfeatures, partialsampling, maxdepth
    ])

X_train = train
Y_train = convert(Array, train[:Survived])

# #Cross Validation - check model accuracy -- This is working fine
# crossval = round(cross_val_score(pipe, X_train, Y_train, cv =10), 2)
# print("\n",crossval,"\n")
# print(mean(crossval))

# GridSearch
grid = Dict(:ntrees => 10:30:240,
            :nsubfeatures => 0:1:13,
            :partialsampling => 0.2:0.1:1.0,
            :maxdepth => -1:2:13
)

gridsearch = GridSearchCV(pipe, grid)
fit!(gridsearch, X_train, Y_train)
println("Best hyper-parameters: $(gridsearch.best_params_)")

The error I get is :

ERROR: LoadError: MethodError: no method matching _fit!(::ScikitLearn.Skcore.GridSearchCV, ::DataFrames.DataFrame, ::Array{Int64,1}, ::ScikitLearn.Skcore.ParameterGrid)
Closest candidates are:
  _fit!(::ScikitLearn.Skcore.BaseSearchCV, !Matched::AbstractArray{T,N}, ::Any, ::Any) at /Users/<user>/.julia/v0.5/ScikitLearn/src/grid_search.jl:254
 in fit!(::ScikitLearn.Skcore.GridSearchCV, ::DataFrames.DataFrame, ::Array{Int64,1}) at /Users/<user>/.julia/v0.5/ScikitLearn/src/grid_search.jl:526
 in include_from_node1(::String) at ./loading.jl:488
 in include_from_node1(::String) at /usr/local/Cellar/julia/0.5.0/lib/julia/sys.dylib:?
 in process_options(::Base.JLOptions) at ./client.jl:262
 in _start() at ./client.jl:318
 in _start() at /usr/local/Cellar/julia/0.5.0/lib/julia/sys.dylib:?
while loading /Users/<path>/Kaggle-001-Julia-MagicalForest.jl, in expression starting on line 538

So the proc is receiving _fit!(::ScikitLearn.Skcore.GridSearchCV, ::DataFrames.DataFrame, ::Array{Int64,1}, ::ScikitLearn.Skcore.ParameterGrid) but expecting an array instead of a Dataframe. The thing is it should have been converted away by the DataFrameMapper.

If needed the full code is there https://github.com/mratsim/MachineLearning_Kaggle/blob/9c07a64a981a6512e021ae01623212a278fd05d1/Kaggle%20-%20001%20-%20Titanic%20Survivors/Kaggle-001-Julia-MagicalForest.jl#L530

cstjean commented 7 years ago

Hi, thank you for filing an issue about this. That's definitely a bug. I think that DataFrames have never been tested as input to grid-search. I just removed the AbstractArray type. Could you please try it out again? (Pkg.checkout("ScikitLearn"))

I'll have more time to look into it tomorrow.

cstjean commented 7 years ago

Pull requests are welcome.

cstjean commented 7 years ago

It looks like this isn't possible with scikit-learn in Python either. See https://github.com/paulgb/sklearn-pandas/issues/61. Some proposed solutions in https://github.com/paulgb/sklearn-pandas/pull/62 and https://github.com/paulgb/sklearn-pandas/pull/64.

The primary challenge is to implement get_params/set_params for DataFrameMapper. Here's the code I used to test it:

using DataFrames: DataFrame
using ScikitLearn
using ScikitLearn.GridSearch: GridSearchCV
@sk_import ensemble: RandomForestClassifier
@sk_import preprocessing: StandardScaler

X_train = DataFrame(Any[randn(100), randn(100)], [:a, :b])
Y_train = rand(0:1, 100)

mapper = DataFrameMapper([([:a, :b], StandardScaler())])
pipe = Pipelines.Pipeline([ 
    ("featurize", mapper), 
    ("forest", RandomForestClassifier(n_estimators=200))
    ])

# GridSearch
grid = Dict(:forest__n_estimators => 10:30:240)

gridsearch = GridSearchCV(pipe, grid)
fit!(gridsearch, X_train, Y_train)
println("Best hyper-parameters: $(gridsearch.best_params_)")

cstjean / ScikitLearn.jl

GridSearch with pipelines of dataframes #24