EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.72k stars 1.57k forks source link

How to use a custom loss function for regression #591

Closed lesshaste closed 7 years ago

lesshaste commented 7 years ago

I am experimenting with tpot for regression with a custom loss function. To do this I have made a toy experiment to see how well it can estimate the permanent of a matrix under particular circumstances. My code samples lots of submatrices of a bigger matrix and trains on that. Unlike a normal regression problem, at test time I want to optimize the correlation coefficient between the predicted and true values. This is where the custom loss function comes in.

Context of the issue

I have specified a simple loss function which is based on scipy.stats.pearsonr. I only modified it to make sure the value is between 0 and 1 and to make it a minimization problem.

Process to reproduce the issue

If you run the following code you will see:

Generation 1 - Current best internal CV score: 1.0
Generation 2 - Current best internal CV score: 1.0
Generation 3 - Current best internal CV score: 1.0

and so on. 1.0 is the worst possible loss score. In other words, it apparently does as badly as it could possibly do. You get something better if you replace the custom loss function by any of the standard built in ones so some optimization is possible for this problem.

In the code the value 4000, should be increased to 40,000 or larger but I have made it small so that it doesn't take too long to run

import sys, math
import numpy as np
from scipy.stats import ortho_group
from scipy.stats import pearsonr
from tpot import TPOTRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

def npperm(M):
    n = M.shape[0]
    d = np.ones(n)
    j = 0
    s = 1
    f = np.arange(n)
    v = M.sum(axis=0)
    p = np.prod(v)
    while (j < n-1):
        v -= 2*d[j]*M[j]
        d[j] = -d[j]
        s = -s
        prod = np.prod(v)
        p += s*prod
        f[0] = 0
        f[j] = f[j+1]
        f[j+1] = j+1
        j = f[0]
    return p/2**(n-1)

# Define loss function to be between 0 and 1 where 0 is the best and 1 is the
# worst for optimisation.
def correlation_coefficient(y_true, y_pred):
    pearson_r, _ = pearsonr(y_pred, y_true)
    return 1-pearson_r**2

dimension = 8

print("Making the input data using seed 7", file=sys.stderr)
np.random.seed(7)
U = ortho_group.rvs(dimension**2)
U = U[:, :dimension]
# U is a random orthogonal matrix
X = []
Y = []
print(U)
for i in range(4000):
    I = np.random.choice(dimension**2, size=dimension)
    A = U[I][np.lexsort(np.rot90(U[I]))]
    X.append(A.ravel())
    Y.append(math.log(npperm(A)**2, 2))

X = np.array(X)
Y = np.array(Y)

# Split into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, Y,
                                                    train_size=0.75, test_size=0.25)

# Create model
tpot = TPOTRegressor(scoring=correlation_coefficient, generations=10,
                                     population_size=40, verbosity=2, n_jobs=4)
tpot.fit(X_train, y_train)
  1. User creates TPOT instance
  2. User calls TPOT fit() function with training data
  3. TPOT crashes with a KeyError after 5 generations

Expected result

I expect the internal CV score to come down from 1.0.

Current result

The internal CV score is stuck at 1.0.

Possible fix

I feel I must be using a custom loss function incorrectly. How should I have done it?

weixuanfu commented 7 years ago

We are working on a new API for scoring function in TPOT related to the issue #579. For now, could you please try to put 'loss' or 'error' into the scoring function's name to make greater_is_better is False in make_scorer function (from this line).

rhiever commented 7 years ago

I think the issue here is the following:

Define loss function to be between 0 and 1 where 0 is the best and 1 is the worst for optimisation.

Currently, TPOT assumes that any custom scoring function is to be maximized (i.e., 1 is best and 0 is worst) unless it has loss or error in the name. Thus, I would simply keep everything the same but change return 1-pearson_r**2 to return pearson_r**2.

As @weixuanfu mentioned, we have some scoring API changes in the works, but this issue can be resolved in the latest release without any TPOT code changes.

lesshaste commented 7 years ago

Thanks so much! That fixes it indeed (I just changed the scoring function to return pearson_r**2). My next challenge is to get the score above 0.15 .