cstjean / ScikitLearn.jl

Julia implementation of the scikit-learn API https://cstjean.github.io/ScikitLearn.jl/dev/
Other
546 stars 75 forks source link

PyError with OneHotEncoder (Julia 0.6.0 on Windows10) #32

Open ValdarT opened 7 years ago

ValdarT commented 7 years ago

I'm getting a PyError with this code.

using DataFrames
using ScikitLearn
@sk_import preprocessing: OneHotEncoder

df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])

mapper = DataFrameMapper([([:B], OneHotEncoder())]);

fit_transform!(mapper, df)
ERROR: PyError (ccall(@pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, arg, C_NULL)) <type 'exceptions.ValueError'>
ValueError('could not convert string to float: M',)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\preprocessing\data.py", line 1844, in fit
    self.fit_transform(X)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\preprocessing\data.py", line 1902, in fit_transform
    self.categorical_features, copy=True)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\preprocessing\data.py", line 1697, in _transform_selected
    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\utils\validation.py", line 382, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)

It seems specific to OneHotEncoder. For example, LabelBinarizer works fine like this:

mapper = DataFrameMapper([(:B, LabelBinarizer())]);

I'm on Windows 10 using Julia 0.6.0. Package versions:

- Conda                         0.5.3
- DataArrays                    0.5.3
- DataFrames                    0.10.0
- PyCall                        1.14.0
- ScikitLearn                   0.3.0
- ScikitLearnBase               0.3.0

I let ScikitLearn.jl automatically handle the installation of Python dependencies. The installed versions are:

python                    2.7.13
numpy                     1.13.0
scikit-learn              0.18.2
cstjean commented 7 years ago

It's probably a bug, but have you checked if the equivalent code works in Python?

You can use ScikitLearn.Preprocessing.DictEncoder() until this gets fixed. The semantics are a bit different, but for single-column input matrices it should be the same:

DictEncoder()

For every unique row in the input matrix, associate a 0/1 binary column in the output matrix. This is similar to OneHotEncoder, but considers the entire row as a single value for one-hot-encoding. It works with any hashable datatype.

It is common to use it inside a DataFrameMapper, with a particular subset of the columns.

cstjean commented 7 years ago

Thank you for the detailed bug report!

ValdarT commented 7 years ago

Sorry, my mistake. Turns out OneHotEncoder only accepts integer values. Rather unexpected and weird in my opinion but clearly stated in the docs. At least I'm not the only one: https://github.com/pandas-dev/sklearn-pandas/issues/63. : )

However, I still get an 'invalid Array dimensions' error with this code

using DataFrames
using ScikitLearn
@sk_import preprocessing: OneHotEncoder

df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])

mapper = DataFrameMapper([([:A], OneHotEncoder())]);

fit_transform!(mapper, df)
invalid Array dimensions

Stacktrace:
 [1] Array{Float64,N} where N(::Tuple{Int64}) at .\boot.jl:317
 [2] py2array(::Type{T} where T, ::PyCall.PyObject) at C:\Users\...\.julia\v0.6\PyCall\src\conversions.jl:381
 [3] convert(::Type{Array{Float64,2}}, ::PyCall.PyObject) at C:\Users\...\.julia\v0.6\PyCall\src\numpy.jl:480
 [4] transform(::ScikitLearn.DataFrameMapper, ::DataFrames.DataFrame) at C:\Users\...\.julia\v0.6\ScikitLearn\src\dataframes.jl:150
 [5] #fit_transform!#16(::Array{Any,1}, ::Function, ::ScikitLearn.DataFrameMapper, ::DataFrames.DataFrame, ::Void) at C:\Users\...\.julia\v0.6\ScikitLearnBase\src\ScikitLearnBase.jl:152
 [6] fit_transform!(::ScikitLearn.DataFrameMapper, ::DataFrames.DataFrame) at C:\Users\...\.julia\v0.6\ScikitLearnBase\src\ScikitLearnBase.jl:152

although this code in Python works fine

import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'A': [1,2,3,4], 'B': ["M", "F", "F", "M"]})
mapper = DataFrameMapper([(['A'], OneHotEncoder())])

mapper.fit_transform(df)
ValdarT commented 7 years ago

Fortunately the change to OneHotEncoder for accepting strings is in the works: https://github.com/scikit-learn/scikit-learn/issues/4920

cstjean commented 7 years ago

Figured it out; OneHotEncoder returns a sparse matrix by default, which PyCall doesn't know how to convert (https://github.com/JuliaPy/PyCall.jl/issues/204). It would have to be fixed there. Or at the very least, there should be a clearer error message on that end.

Fortunately, you can solve the problem with OneHotEncoder(sparse=false).

Turns out OneHotEncoder only accepts integer values

Use DictEncoder! It's pure Julia, so it'll be way faster than OneHotEncoder, and it will work with any hashable type (almost anything).

ValdarT commented 7 years ago

Thank you!

Use DictEncoder!

Will do.

Hopefully we can soon replace all the preprocessing steps with pure Julia implementations. The work at JuliaML seems to get there step-by-step.