Open chkoar opened 8 years ago
Hi,
The problem seems to be because anneal.arff has a class with 0 instances. When the random forest classifier in Scikit is trained, it thinks that there actually 5 classes, instead of 6. Perhaps I should do something like post a warning that classes with no instances in them might be a problem. What other datasets do you get this issue with? Do those other datasets have classes with no instances for them?
Hello Christopher,
Sorry for my delayed reply. I am getting the same message for few other datasets. For instance, arhythmia, audiology and autos datasets raise the same error.
Hi,
arrhythmia.arff and autos.arff have classes with zero instances assigned to them so you get this error (see my explanation above). audiology.arff has at least one instance for every class so there is no error. However, because the dataset has missing values (which is represented in Python as NaNs), you get an error like this:
java.lang.Exception: An error happened while executing the train() function:
Traceback (most recent call last):
File "/Users/cjb60/Desktop/weka/packages/wekaPython/resources/py/pyServer.py", line 291, in execute_script
exec (script, _global_env)
File "<string>", line 1, in <module>
File "/Users/cjb60/github/weka-pyscript/wekapyscript/wekapyscript.py", line 34, in wrapped_f
return f(*args)
File "scikit-rf.py", line 11, in train
rf = rf.fit(X_train, y_train)
File "//anaconda/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 195, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "//anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 352, in check_array
_assert_all_finite(array)
File "//anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 52, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float32')
Is this the error you got as well? You can fix this by dealing with the missing values, either by setting the impute option to True for PyScriptClassifier (which does mean imputation), or by processing the dataset beforehand using something like FilteredClassifier.
Hey, I had already used the imputation in the PyScriptClassifier. These are the datasets that I had problem with. Mainly because they have classes with zero instances assigned.
Which of those datasets have instances assigned to all classes but still give you an error? I'm curious if there are any other errors you picked up (apart from the NaN one related to missing values).
Your best bet is to process the ARFF so that there are no classes with zero instances assigned to them. When the ARFF is converted to Numpy format (which is what happens when it passes the data to the Python script), the Numpy data structure has no knowledge of the classes with zero instances -- after all, it's just a data matrix, and doesn't have a bunch of meta-data like the ARFF format does. That means that if there are actually 10 classes in the dataset but only 5 are "visible", Scikit-Learn is going to think there are only 5 classes. I can see how this can raise issues... e.g. what if the test set has classes that the training set doesn't (though that's a bizarre case IMO). Maybe there is a way to pre-specify to Scikit-Learn that you actually know the number of classes.
I found a parameter called class_weight
, maybe that would be the way to do it:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Datasets that have instances assigned to all classes but still give me error: audiology flags postoperative.patient.data shuttle.landing.control solar.flare1 solar.flare2
When the ARFF is converted to Numpy format (which is what happens when it passes the data to the Python script), the Numpy data structure has no knowledge of the classes with zero instances -- after all, it's just a data matrix, and doesn't have a bunch of meta-data like the ARFF format does
Sure.
What error are you getting for flags.arff? And what script are you using with it? Could you let me know what error you get for each of those datasets?
On some data sets, e.g. anneal, I am getting this message
The script I used is the following.