christopher-beckham / weka-pyscript

WEKA classifier to execute arbitrary Python scripts
GNU General Public License v3.0
12 stars 2 forks source link

Problem evaluating classifier: Index: x, Size: x #7

Open chkoar opened 8 years ago

chkoar commented 8 years ago

On some data sets, e.g. anneal, I am getting this message

2016-04-25 20:33:49 weka.gui.explorer.ClassifierPanel$18 run
INFO: Started weka.classifiers.pyscript.PyScriptClassifier
2016-04-25 20:33:49 weka.gui.explorer.ClassifierPanel$18 run
INFO: Command: weka.classifiers.pyscript.PyScriptClassifier -batch 100 -cmd python -script C:\Temp\sl.py -binarize -impute
java.lang.IndexOutOfBoundsException: Index: 5, Size: 5
    java.util.ArrayList.rangeCheck(Unknown Source)
    java.util.ArrayList.get(Unknown Source)
    org.boon.core.value.ValueList.get(ValueList.java:51)
    weka.classifiers.pyscript.PyScriptClassifier.distributionsForInstances(PyScriptClassifier.java:436)
    weka.gui.explorer.ClassifierPanel$18.run(ClassifierPanel.java:1450)

    at java.util.ArrayList.rangeCheck(Unknown Source)
    at java.util.ArrayList.get(Unknown Source)
    at org.boon.core.value.ValueList.get(ValueList.java:51)
    at weka.classifiers.pyscript.PyScriptClassifier.distributionsForInstances(PyScriptClassifier.java:436)
    at weka.gui.explorer.ClassifierPanel$18.run(ClassifierPanel.java:1450)
java.lang.IndexOutOfBoundsException: Index: 5, Size: 5
    java.util.ArrayList.rangeCheck(Unknown Source)
    java.util.ArrayList.get(Unknown Source)
    org.boon.core.value.ValueList.get(ValueList.java:51)
    weka.classifiers.pyscript.PyScriptClassifier.distributionsForInstances(PyScriptClassifier.java:436)
    weka.gui.explorer.ClassifierPanel$18.run(ClassifierPanel.java:1450)

    at java.util.ArrayList.rangeCheck(Unknown Source)
    at java.util.ArrayList.get(Unknown Source)
    at org.boon.core.value.ValueList.get(ValueList.java:51)
    at weka.classifiers.pyscript.PyScriptClassifier.distributionsForInstances(PyScriptClassifier.java:436)
    at weka.gui.explorer.ClassifierPanel$18.run(ClassifierPanel.java:1450)

The script I used is the following.

from __future__ import print_function
from sklearn.ensemble import RandomForestClassifier
from wekapyscript import ArffToArgs, uses

def train(args):
    X_train = args["X_train"]
    y_train = args["y_train"].flatten()
    rf = RandomForestClassifier(n_estimators=10, random_state=0)
    rf = rf.fit(X_train, y_train)
    return rf

def describe(args, model):
    return "RandomForestClassifier with %i trees" % model.n_estimators

def test(args, model):
    X_test = args["X_test"]
    return model.predict_proba(X_test).tolist()
christopher-beckham commented 8 years ago

Hi,

The problem seems to be because anneal.arff has a class with 0 instances. When the random forest classifier in Scikit is trained, it thinks that there actually 5 classes, instead of 6. Perhaps I should do something like post a warning that classes with no instances in them might be a problem. What other datasets do you get this issue with? Do those other datasets have classes with no instances for them?

chkoar commented 8 years ago

Hello Christopher,

Sorry for my delayed reply. I am getting the same message for few other datasets. For instance, arhythmia, audiology and autos datasets raise the same error.

christopher-beckham commented 8 years ago

Hi,

arrhythmia.arff and autos.arff have classes with zero instances assigned to them so you get this error (see my explanation above). audiology.arff has at least one instance for every class so there is no error. However, because the dataset has missing values (which is represented in Python as NaNs), you get an error like this:

java.lang.Exception: An error happened while executing the train() function:
Traceback (most recent call last):
  File "/Users/cjb60/Desktop/weka/packages/wekaPython/resources/py/pyServer.py", line 291, in execute_script
    exec (script, _global_env)
  File "<string>", line 1, in <module>
  File "/Users/cjb60/github/weka-pyscript/wekapyscript/wekapyscript.py", line 34, in wrapped_f
    return f(*args)
  File "scikit-rf.py", line 11, in train
    rf = rf.fit(X_train, y_train)
  File "//anaconda/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 195, in fit
    X = check_array(X, dtype=DTYPE, accept_sparse="csc")
  File "//anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 352, in check_array
    _assert_all_finite(array)
  File "//anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 52, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float32')

Is this the error you got as well? You can fix this by dealing with the missing values, either by setting the impute option to True for PyScriptClassifier (which does mean imputation), or by processing the dataset beforehand using something like FilteredClassifier.

chkoar commented 8 years ago

Hey, I had already used the imputation in the PyScriptClassifier. These are the datasets that I had problem with. Mainly because they have classes with zero instances assigned.

screenshot_1

christopher-beckham commented 8 years ago

Which of those datasets have instances assigned to all classes but still give you an error? I'm curious if there are any other errors you picked up (apart from the NaN one related to missing values).

Your best bet is to process the ARFF so that there are no classes with zero instances assigned to them. When the ARFF is converted to Numpy format (which is what happens when it passes the data to the Python script), the Numpy data structure has no knowledge of the classes with zero instances -- after all, it's just a data matrix, and doesn't have a bunch of meta-data like the ARFF format does. That means that if there are actually 10 classes in the dataset but only 5 are "visible", Scikit-Learn is going to think there are only 5 classes. I can see how this can raise issues... e.g. what if the test set has classes that the training set doesn't (though that's a bizarre case IMO). Maybe there is a way to pre-specify to Scikit-Learn that you actually know the number of classes.

I found a parameter called class_weight, maybe that would be the way to do it:

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

chkoar commented 8 years ago

Datasets that have instances assigned to all classes but still give me error: audiology flags postoperative.patient.data shuttle.landing.control solar.flare1 solar.flare2

When the ARFF is converted to Numpy format (which is what happens when it passes the data to the Python script), the Numpy data structure has no knowledge of the classes with zero instances -- after all, it's just a data matrix, and doesn't have a bunch of meta-data like the ARFF format does

Sure.

christopher-beckham commented 8 years ago

What error are you getting for flags.arff? And what script are you using with it? Could you let me know what error you get for each of those datasets?