EducationalTestingService / skll

SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.
http://skll.readthedocs.org
Other
551 stars 67 forks source link

SVR kernel string issues with dense data in python 2.7 #87

Closed mheilman closed 6 years ago

mheilman commented 10 years ago

If feature scaling is set to both when using SVR (causing dense matrices to be used), then SVR complains (see below).

I'm guessing this started with the 0.17.1 bug fix release that aimed to fix a similar issue for python 3.

(python 2 unicode support strikes again!)

Traceback (most recent call last):
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 196, in execute
    self.ret = self.function(*self.args, **self.kwlist)
  File "/opt/python/2.7/lib/python2.7/site-packages/skll/experiments.py", line 656, in _classify_featureset
    grid_jobs=grid_search_jobs)
  File "/opt/python/2.7/lib/python2.7/site-packages/skll/learner.py", line 1116, in cross_validate
    shuffle=False))
  File "/opt/python/2.7/lib/python2.7/site-packages/skll/learner.py", line 809, in train
    grid_searcher.fit(xtrain, classes)
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/grid_search.py", line 707, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/grid_search.py", line 493, in _fit
    for parameters in parameter_iterable
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 517, in __call__
    self.dispatch(function, args, kwargs)
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 312, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 136, in __init__
    self.results = func(*args, **kwargs)
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/grid_search.py", line 306, in fit_grid_point
    clf.fit(X_train, y_train, **fit_params)
  File "/opt/python/2.7/lib/python2.7/site-packages/skll/learner.py", line 224, in fit
    orig_fit(self, X, y=y)
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 178, in fit
    fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 233, in _dense_fit
    max_iter=self.max_iter, random_seed=random_seed)
TypeError: Argument 'kernel' has incorrect type (expected str, got unicode)
done
dan-blanchard commented 10 years ago

It's really strange that you're only seeing that with scaling set to both. From looking at the code, it seems like that should happen all the time, because this line incorrectly says decode instead of encode.

mheilman commented 10 years ago

It looks like sklearn uses different fit methods for SVR (and maybe other SVM models) depending on whether the data is sparse or dense. I think one may expect a string while the other expects bytes or something.

You added that decode for python 3, I think. Are you sure it will work for all combinations of {python2, python3} and {feature_scaling=True, feature_scaling=False}?

We probably should add a unit test to cover feature_scaling with regression.

dan-blanchard commented 10 years ago

The encode line only gets run for Python 2.7, so it won't break anything for Python 3. It shouldn't have worked at all for Python 2 previously, but I guess, like you said, one fit method can handle unicode or bytes, whereas the other can only handle bytes.

jenniferzhu commented 6 years ago

This issue is still happening to python2 if I install anaconda3 and then create a python 2.7 environment in anaconda3.

desilinguist commented 6 years ago

Sorry, can you please be a little more clear? If you can provide the SKLL version you are using and the config file for the exact experiment you are running, that would be very useful since I am unable to replicate the issue.

JoshuaMeyers commented 6 years ago

I've had a similar issue. For me it lies with the predict_proba function and the ability of the SVC class to handle mixed types (unicodes and strings) for the kernel parameter. The bound on this is because the libSVM package only handles strings. As stated above this only matters in the case of dense matrices for some reason...

Here is the error recreated:

from sklearn import svm
clf = svm.SVC(kernel=u'rbf', probability=True) # note the unicode object
clf.fit([[1,2],[2,1],[1,2]], [1,0,1])
clf.predict([[2,1]]) # works fine
clf.predict_proba([[2,1]]) 

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-98-513295ea1832> in <module>()
      3 clf.fit([[1,2],[2,1],[1,2]], [1,0,1])
      4 clf.predict([[2,1]])
----> 5 clf.predict_proba([[2,1]])

/anaconda2/lib/python2.7/site-packages/sklearn/svm/base.pyc in _predict_proba(self, X)
    598         pred_proba = (self._sparse_predict_proba
    599                       if self._sparse else self._dense_predict_proba)
--> 600         return pred_proba(X)
    601 
    602     @property

/anaconda2/lib/python2.7/site-packages/sklearn/svm/base.pyc in _dense_predict_proba(self, X)
    646                 if six.PY3:
    647                     # In python3 ensure kernel is utf8 unicode to prevent a TypeError
--> 648                     if isinstance(kernel, bytes):
    649                         kernel = str(kernel, 'utf8')
    650 

TypeError: Argument 'kernel' has incorrect type (expected str, got unicode)

This error does not occur when the kernel parameter is transformed to a string.

This is explicitly fixed in the predict function and as this behaviour is inconsistent with other parameters in sci-kit learn, I have raised a merge request for this

In the short term, kernel=str(u'rbf') should do the trick!

desilinguist commented 6 years ago

Thanks, @JoshuaMeyers for raising a merge request for this on the scikit-learn side which is where it should really be fixed! SKLL will inherit the fix when we do the next release.

And thanks for the workaround as well!

desilinguist commented 6 years ago

We will implement the temporary workaround from @JoshuaMeyers in the next .1 release probably in early 2018.

JoshuaMeyers commented 6 years ago

No prob! For reference the PR is here https://github.com/scikit-learn/scikit-learn/pull/10338/commits/45cd060adf19296438112d1857fa19029d699bf3

desilinguist commented 6 years ago

@JoshuaMeyers Do you have an easy way that I can replicate this in SKLL? I am having trouble replicating it because SKLL almost always uses sparse feature arrays.

desilinguist commented 6 years ago

I just tried to replicate this issue both via the API and via a config file but couldn't. In both cases, I was able to generate predictions with probabilities just fine. I am also setting feature_scaling to both to force dense arrays being passed to scikit-learn.

API code:

from skll import FeatureSet, Learner

l = Learner('SVC', model_kwargs={'kernel': 'rbf'}, feature_scaling='both', probability=True)
train_fs = FeatureSet('train', ids=[1, 2, 3], features=[{'a': 1, 'b': 2}, {'a': 2, 'b': 1}, {'a': 1, 'b': 2}], labels=[1, 0, 1])
score = l.train(train_fs, grid_search=False)
test_fs = FeatureSet('test', ids=[1], features=[{'a': 2, 'b': 1}])
predictions = l.predict(test_fs)
print(predictions)

Config file:

[General]
experiment_name = test_kernel
task = predict

[Input]
train_file = train.csv
test_file = test.csv
learners = ["SVC"]
fixed_parameters = [{'kernel': 'rbf'}]
feature_scaling = both
label_col = y
id_col = id

[Tuning]
grid_search = false

[Output]
probability = true
log = output
models = output
predictions = output

The config file uses the following two feature files:

(a) train.csv

id,a,b,y
1,1,2,1
2,2,1,0
3,1,2,1

(b) test.csv

id,a,b
1,2,1

@JoshuaMeyers @jenniferzhu if you can please let me know how to reproduce this error in SKLL, please do. Otherwise I will close this issue.

JoshuaMeyers commented 6 years ago

Leave feature_scaling='both' and change the argument for kernel to a unicode object: fixed_parameters = [{'kernel': u'rbf'}]

desilinguist commented 6 years ago

But why would anyone do that deliberately when it works just fine with ‘rbf’? On Wed, Dec 20, 2017 at 5:15 AM JoshuaMeyers notifications@github.com wrote:

Leave feature_scaling='both' and change the argument for kernel to a unicode object: fixed_parameters = [{'kernel': u'rbf'}]

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/EducationalTestingService/skll/issues/87#issuecomment-353022806, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJ3kE8X3HrMnbFyTcsRNqRYcb12ixnLks5tCN4ugaJpZM4BOu3a .

desilinguist commented 6 years ago

I am going to close this issue since I don't see why one would unnecessarily use a unicode string when a plain old string works just fine as my tests above show.