Fail to run M4GP on some datasets

hengzhe-zhang commented 4 years ago

M4GP crashes when executing the following code, but the corresponding error message has not appeared. It is fine to classify the "iris" dataset, but fail to do the same operation on the “soybean” dataset. So, what should I do?

import os
from pathlib import Path

from pmlb import fetch_data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from ellyn import ellyn

# iris = load_iris()
# x = iris.data
# y = iris.target

dataset = 'soybean'
local_dir = os.path.join(Path.home(), "pmlb_dataset")
data = fetch_data(dataset, return_X_y=False, local_cache_dir=local_dir)
x = data.drop('target', axis=1).values
y = data['target'].values
print(x.shape)

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
e = ellyn(g=5, popsize=5, classification=True)
e.fit(X_train, y_train)
print(e.predict(X_test))
print(accuracy_score(y_test, e.predict(X_test)))

lacava commented 4 years ago

thanks for reporting this. could you add the argument verbosity=2 to the ellyn object and paste the output of running this code here?

hengzhe-zhang commented 4 years ago

params
==========
verbosity : 2
classification : True
popsize : 5
g : 5
scoring_function : <function accuracy_score at 0x7fae670b61e0>
random_state : 0
selection : tournament
best_estimator_ : []
hof : []
return_pop : False
class_m4gp : True
sel : 1
_______________________________________________________________________________ 
                                    ellenGP                                     
_______________________________________________________________________________ 
Results Path: /tmp/pycharm_project_606
parameter name: ellenGP
data file: d
Settings: 
Evolutionary Method: Standard Tournament
ERCs on
Total Population Size: 5
Maximum Generations: 5
Number of log points: 0 (0 means log all points)
fitness type: MSE
verbosity: 2
Number of threads: 64

hengzhe-zhang commented 4 years ago

Any solution? This strange bug is impeding me from applying M4GP in my application.

lacava commented 4 years ago

Sorry for the delay, I'll try to work on it at the end of this week.

lacava commented 4 years ago

hi @zhenlingcn , the problem is that soybean has 17 classes, but class 13 is missing, so it is encoded as 18 classes. this will cause a seg fault with ellyn. all class labels have to be contiguous and present . here's a solution:

import os
from pathlib import Path
import numpy as np
from pmlb import fetch_data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from ellyn import ellyn

# iris = load_iris()
# x = iris.data
# y = iris.target

dataset = 'soybean'
# local_dir = os.path.join(Path.home(), "pmlb_dataset")
data = fetch_data(dataset, return_X_y=False) #, local_cache_dir=local_dir)
x = data.drop('target', axis=1).values
y = data['target'].values
print('y unique:',np.unique(y))
print('x:',x)
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
yle = le.fit_transform(y)
print('yle unique:',np.unique(yle))

X_train, X_test, y_train, y_test = train_test_split(x, yle, 
        test_size=0.2, random_state=0)
e = ellyn(g=5, popsize=5, classification=True, verbosity=2)
e.fit(X_train, y_train)
print(e.predict(X_test))
print(accuracy_score(y_test, e.predict(X_test)))

lacava commented 4 years ago

@zhenlingcn let me know if this fixes your issue and I'll mark this as closed.

hengzhe-zhang commented 4 years ago

The issue is solved, thank you very much!

cavalab / ellyn

Fail to run M4GP on some datasets #6