Closed mheilman closed 6 years ago
It's really strange that you're only seeing that with scaling set to both. From looking at the code, it seems like that should happen all the time, because this line incorrectly says decode
instead of encode
.
It looks like sklearn uses different fit methods for SVR (and maybe other SVM models) depending on whether the data is sparse or dense. I think one may expect a string while the other expects bytes or something.
You added that decode for python 3, I think. Are you sure it will work for all combinations of {python2, python3} and {feature_scaling=True, feature_scaling=False}?
We probably should add a unit test to cover feature_scaling with regression.
The encode
line only gets run for Python 2.7, so it won't break anything for Python 3. It shouldn't have worked at all for Python 2 previously, but I guess, like you said, one fit method can handle unicode
or bytes
, whereas the other can only handle bytes
.
This issue is still happening to python2 if I install anaconda3 and then create a python 2.7 environment in anaconda3.
Sorry, can you please be a little more clear? If you can provide the SKLL version you are using and the config file for the exact experiment you are running, that would be very useful since I am unable to replicate the issue.
I've had a similar issue. For me it lies with the predict_proba
function and the ability of the SVC
class to handle mixed types (unicodes and strings) for the kernel
parameter. The bound on this is because the libSVM package only handles strings. As stated above this only matters in the case of dense matrices for some reason...
Here is the error recreated:
from sklearn import svm
clf = svm.SVC(kernel=u'rbf', probability=True) # note the unicode object
clf.fit([[1,2],[2,1],[1,2]], [1,0,1])
clf.predict([[2,1]]) # works fine
clf.predict_proba([[2,1]])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-98-513295ea1832> in <module>()
3 clf.fit([[1,2],[2,1],[1,2]], [1,0,1])
4 clf.predict([[2,1]])
----> 5 clf.predict_proba([[2,1]])
/anaconda2/lib/python2.7/site-packages/sklearn/svm/base.pyc in _predict_proba(self, X)
598 pred_proba = (self._sparse_predict_proba
599 if self._sparse else self._dense_predict_proba)
--> 600 return pred_proba(X)
601
602 @property
/anaconda2/lib/python2.7/site-packages/sklearn/svm/base.pyc in _dense_predict_proba(self, X)
646 if six.PY3:
647 # In python3 ensure kernel is utf8 unicode to prevent a TypeError
--> 648 if isinstance(kernel, bytes):
649 kernel = str(kernel, 'utf8')
650
TypeError: Argument 'kernel' has incorrect type (expected str, got unicode)
This error does not occur when the kernel
parameter is transformed to a string.
This is explicitly fixed in the predict
function and as this behaviour is inconsistent with other parameters in sci-kit learn, I have raised a merge request for this
In the short term, kernel=str(u'rbf')
should do the trick!
Thanks, @JoshuaMeyers for raising a merge request for this on the scikit-learn side which is where it should really be fixed! SKLL will inherit the fix when we do the next release.
And thanks for the workaround as well!
We will implement the temporary workaround from @JoshuaMeyers in the next .1 release probably in early 2018.
No prob! For reference the PR is here https://github.com/scikit-learn/scikit-learn/pull/10338/commits/45cd060adf19296438112d1857fa19029d699bf3
@JoshuaMeyers Do you have an easy way that I can replicate this in SKLL? I am having trouble replicating it because SKLL almost always uses sparse feature arrays.
I just tried to replicate this issue both via the API and via a config file but couldn't. In both cases, I was able to generate predictions with probabilities just fine. I am also setting feature_scaling
to both
to force dense arrays being passed to scikit-learn.
API code:
from skll import FeatureSet, Learner
l = Learner('SVC', model_kwargs={'kernel': 'rbf'}, feature_scaling='both', probability=True)
train_fs = FeatureSet('train', ids=[1, 2, 3], features=[{'a': 1, 'b': 2}, {'a': 2, 'b': 1}, {'a': 1, 'b': 2}], labels=[1, 0, 1])
score = l.train(train_fs, grid_search=False)
test_fs = FeatureSet('test', ids=[1], features=[{'a': 2, 'b': 1}])
predictions = l.predict(test_fs)
print(predictions)
Config file:
[General]
experiment_name = test_kernel
task = predict
[Input]
train_file = train.csv
test_file = test.csv
learners = ["SVC"]
fixed_parameters = [{'kernel': 'rbf'}]
feature_scaling = both
label_col = y
id_col = id
[Tuning]
grid_search = false
[Output]
probability = true
log = output
models = output
predictions = output
The config file uses the following two feature files:
(a) train.csv
id,a,b,y
1,1,2,1
2,2,1,0
3,1,2,1
(b) test.csv
id,a,b
1,2,1
@JoshuaMeyers @jenniferzhu if you can please let me know how to reproduce this error in SKLL, please do. Otherwise I will close this issue.
Leave feature_scaling='both'
and change the argument for kernel to a unicode object:
fixed_parameters = [{'kernel': u'rbf'}]
But why would anyone do that deliberately when it works just fine with ‘rbf’? On Wed, Dec 20, 2017 at 5:15 AM JoshuaMeyers notifications@github.com wrote:
Leave feature_scaling='both' and change the argument for kernel to a unicode object: fixed_parameters = [{'kernel': u'rbf'}]
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/EducationalTestingService/skll/issues/87#issuecomment-353022806, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJ3kE8X3HrMnbFyTcsRNqRYcb12ixnLks5tCN4ugaJpZM4BOu3a .
I am going to close this issue since I don't see why one would unnecessarily use a unicode string when a plain old string works just fine as my tests above show.
If feature scaling is set to both when using SVR (causing dense matrices to be used), then SVR complains (see below).
I'm guessing this started with the 0.17.1 bug fix release that aimed to fix a similar issue for python 3.
(python 2 unicode support strikes again!)