EducationalTestingService / skll

SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.
http://skll.readthedocs.org
Other
550 stars 69 forks source link

`Learner._check_input_formatting()` does not work for dense featuresets #656

Closed desilinguist closed 3 years ago

desilinguist commented 3 years ago

This method is called by Learner._train_setup() and it checks that regression labels are not strings and that feature values (for both classification and regression) are not strings. However, this method does not work as expected if the featureset is read in as dense rather than sparse. Here's a minimal test case:

>>> from skll.data import NDJReader
>>> fs1 = NDJReader.for_path("examples/boston/train/example_boston_features.jsonlines", sparse=False).read()
>>> l1 = Learner('LinearRegression')
>>> fs2 = NDJReader.for_path("examples/iris/train/example_iris_features.jsonlines", sparse=False).read()
>>> l2 = Learner('LogisticRegression')
>>> l1.train(fs1, grid_search=False)
...
~/work/skll/skll/learner/__init__.py in _check_input_formatting(self, examples)
    664         # make sure that feature values are not strings
    665         # we need to check this for both sparse and dense arrays
--> 666         for val in examples.features.data:
    667             if isinstance(val, str):
    668                 raise TypeError("You have feature values that are strings.  "

NotImplementedError: multi-dimensional sub-views are not implemented

>>> l2.train(fs2, grid_search=False)
....
~/work/skll/skll/learner/__init__.py in _check_input_formatting(self, examples)
    664         # make sure that feature values are not strings
    665         # we need to check this for both sparse and dense arrays
--> 666         for val in examples.features.data:
    667             if isinstance(val, str):
    668                 raise TypeError("You have feature values that are strings.  "

NotImplementedError: multi-dimensional sub-views are not implemented

The solution is to explicitly reshape the dense feature array into a 1-dimensional array before iterating over .data attribute.