The poor performance of M4GP for classification

hengzhe-zhang commented 4 years ago

What is the best practice of M4GP on the classification problem? I have trained a classifier on the "soybean" dataset, and find it performs much worse than KNN. So, how can I tune the parameter to make it performs much better? Code:

import numpy as np
from pmlb import fetch_data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder

from ellyn import ellyn

dataset = 'soybean'
data = fetch_data(dataset, return_X_y=False)
x = data.drop('target', axis=1).values
y = data['target'].values
print('y unique:', np.unique(y))
print('x:', x)

le = LabelEncoder()
yle = le.fit_transform(y)
print('yle unique:', np.unique(yle))

X_train, X_test, y_train, y_test = train_test_split(x, yle,
                                                    test_size=0.2, random_state=0)
e = ellyn(g=200, popsize=50, classification=True, verbosity=2,
          selection='lexicase')
e.fit(X_train, y_train)
print(e.predict(X_test))
print('M4GP', accuracy_score(y_test, e.predict(X_test)))

e = KNeighborsClassifier()
e.fit(X_train, y_train)
print('KNN', accuracy_score(y_test, e.predict(X_test)))

Result:

M4GP 0.17777777777777778
KNN 0.762962962962963

lacava commented 4 years ago

What is the best practice of M4GP on the classification problem?

Like any other classification method, the best practices are to tune the hyperparameters in the loop with cross validation. See Table 2 in here: http://www.williamlacava.com/pubs/evobio_m4gp_lacava.pdf for settings we used for hyperparameter tuning. Or see Table 2 in here: http://www.williamlacava.com/pubs/Multiclass_GP_journal_preprint.pdf for the settings we used without hyperparameter tuning.

I have trained a classifier on the "soybean" dataset, and find it performs much worse than KNN.

Ok

So, how can I tune the parameter to make it performs much better?

You may be able to, or you may not be able to. No one algorithm is going to be best on all datasets. In fact we show a few datasets in our original conference paper where KNN outperforms M4GP. On other datasets, M4GP performs better than KNN.

I noticed when debugging issue #6 that M4GP finds a perfect model on the training data in the initial population. This makes soybean an unusual dataset. If you need a model for this soybean dataset and KNN works the best, use KNN.

hengzhe-zhang commented 4 years ago

Thanks for your reply. I have tried another dataset "Vowel", which was used in the paper (http://www.williamlacava.com/pubs/Multiclass_GP_journal_preprint.pdf). However, the result is frustrating, and it seems much worse than KNN. There are too many parameters in the function. So, I do not know which parameter should I tune to align with the parameters used in the paper. Code:

import os
from pathlib import Path

import numpy as np
from pmlb import fetch_data
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder

from ellyn import ellyn

dataset = 'vowel'
local_dir = os.path.join(Path.home(), "pmlb_dataset")

data = fetch_data(dataset, return_X_y=False, local_cache_dir=local_dir)
x = data.drop('target', axis=1).values
y = data['target'].values

print('y unique:', np.unique(y))
print('x:', x)

le = LabelEncoder()
yle = le.fit_transform(y)
print('yle unique:', np.unique(yle))

X_train, X_test, y_train, y_test = train_test_split(x, yle,
                                                    test_size=0.2, random_state=0)

e = ellyn(g=200, popsize=500, classification=True, verbosity=0,
          selection='afp')
e.fit(X_train, y_train)
print(e.predict(X_test))
print('M4GP', accuracy_score(y_test, e.predict(X_test)))

e = KNeighborsClassifier()
e.fit(X_train, y_train)
print('KNN', accuracy_score(y_test, e.predict(X_test)))

e = RandomForestClassifier()
e.fit(X_train, y_train)
print('RF', accuracy_score(y_test, e.predict(X_test)))

Result:

M4GP 0.30303030303030304
KNN 0.8939393939393939
RF 0.9696969696969697

lacava commented 4 years ago

hm, that looks like a bug. let me dig into it more.

lacava commented 4 years ago

hi @zhenlingcn, I figured out what's going on. the short fix to your issue is to specify a fit_type in ellyn. this can be fit_type = 'MAE', fit_type = 'F1' (F1 score), or fit_type = 'F1W' (weighted F1). just add this to the ellyn() initialization and that should fix your issue.

the long fix is to have better default behavior. If this parameter isn't specified it should be set; instead it is defaulting to not assigning fitness, which makes the run end after 0 generations. i will be issuing a PR to fix this momentarily.

hengzhe-zhang commented 4 years ago

Well! After manually specifying the fit type, M4GP achieves a reasonable result. Thanks a lot!

cavalab / ellyn

The poor performance of M4GP for classification #7