logloss issue with multiclass task

chwang1991 commented 4 years ago

Hi HunterMcGushion,

I am doing a multi-classification task and I wanna set sklearn.metrics.log_loss as the experiment metric, but I have a trouble:

env = Environment(
    results_path=HPHPATH, 
    train_dataset=df, 
    target_column='Quality_label',
    metrics=['log_loss'],
#     metrics=dict(logloss=lambda y_true, y_pred: metrics.log_loss(y_true, y_pred, labels=[0,1,2,3])),
    cv_type=StratifiedKFold,
    cv_params=dict(n_splits=6, shuffle=True),
#     global_random_seed=seed
)

experiment = CVExperiment(
    model_initializer=RandomForestClassifier,
    model_init_params=dict(
        n_estimators=20
    )
)

See, the target has 4 labels, 0 to 3. When I run the code above, it triggers a value error:

ValueError: y_true and y_pred contain different number of classes 4, 2.

If I set labels for logloss metric, metrics=dict(logloss=lambda y_true, y_pred: metrics.log_loss(y_true, y_pred, labels=[0,1,2,3])), it throws out another error:

ValueError: The number of classes in labels is different from that in y_pred.

I checked the examples and previous issues like #90, and I wonder have you tested logloss for multiclass task?

HunterMcGushion commented 4 years ago

Thanks for opening this! Sorry I don't have a more helpful update, but I just wanted to say that I'm looking into this, and I do think there's a bug here. I'm making some regression tests using the Iris dataset, with a single label-encoded target column, so I'll probably ask you to try to reproduce some of my results when I'm further along in the bug hunt.

In the meantime, have you tried adjusting the do_predict_proba kwarg of your Environment? Are you expecting log_loss to be called with a single column of label-encoded predictions, or four columns of class probabilities? Because I believe the former won't work, as log_loss automatically assumes a 1-dimensional y_pred to be binary...

Like I said, I need to investigate some more, but I'd really appreciate you commenting any of your findings here!

Edit: Thanks for looking for related issues, as well!

chwang1991 commented 4 years ago

Thanks for your quick reply!

Sure I tried setting do_predict_proba=True but it didn't help. Seems that it refused to accept multi-column predicts for some reason.

I have to say log_loss is a bit special 'cause it requires (n_samples,n_classes) y_pred, while other examples you tested before, I guess, forced the input y_pred to be 1 column.

Here is the code I used to test:

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from hyperparameter_hunter import Environment, CVExperiment, BayesianOptPro

# make a toy dataset
x,y = make_classification(n_samples=1000,n_classes=4,n_informative=10)
train_df = pd.DataFrame(x, columns=range(x.shape[1]))
train_df["y"] = y

'''
TEST 1
metrics=["log_loss"]
do_predict_proba=False

ValueError: y_true and y_pred contain different number of classes 4, 2. 
Please provide the true labels explicitly through the labels argument. Classes found in y_true: [0 1 2 3]
'''
env1 = Environment(
    train_dataset=train_df,
    results_path="HyperparameterHunterAssets",
    target_column="y",
    metrics=["log_loss"],
    do_predict_proba=False,
    cv_type="StratifiedKFold",
    cv_params=dict(n_splits=5, random_state=32),
    verbose=1,
)

'''
TEST 2
metrics=dict(logloss=lambda y_true, y_pred: metrics.log_loss(y_true, y_pred, labels=[0,1,2,3]))
do_predict_proba=False

ValueError: The number of classes in labels is different from that in y_pred. 
Classes found in labels: [0 1 2 3]
'''
env2 = Environment(
    train_dataset=train_df,
    results_path="HyperparameterHunterAssets",
    target_column="y",
    metrics=dict(logloss=lambda y_true, y_pred: metrics.log_loss(y_true, y_pred, labels=[0,1,2,3])),
    do_predict_proba=False,
    cv_type="StratifiedKFold",
    cv_params=dict(n_splits=5, random_state=32),
    verbose=1,
)

'''
TEST 3
metrics=["log_loss"]
do_predict_proba=True

ValueError: Wrong number of items passed 4, placement implies 1
'''
env3 = Environment(
    train_dataset=train_df,
    results_path="HyperparameterHunterAssets",
    target_column="y",
    metrics=["log_loss"],
    do_predict_proba=True,
    cv_type="StratifiedKFold",
    cv_params=dict(n_splits=5, random_state=32),
    verbose=1,
)

'''
TEST 4
metrics=dict(logloss=lambda y_true, y_pred: metrics.log_loss(y_true, y_pred, labels=[0,1,2,3]))
do_predict_proba=True

ValueError: Wrong number of items passed 4, placement implies 1
'''
env4 = Environment(
    train_dataset=train_df,
    results_path="HyperparameterHunterAssets",
    target_column="y",
    metrics=dict(logloss=lambda y_true, y_pred: metrics.log_loss(y_true, y_pred, labels=[0,1,2,3])),
    do_predict_proba=True,
    cv_type="StratifiedKFold",
    cv_params=dict(n_splits=5, random_state=32),
    verbose=1,
)

experiment = CVExperiment(
    model_initializer=RandomForestClassifier,
    model_init_params=dict(n_estimators=10),
)

HunterMcGushion commented 4 years ago

@chwang1991,

Thanks for posting your sample code! It's very helpful! Sorry for the delay, but I've been busy with other things lately. I'm looking at this issue again today, and I have to agree with you log_loss does seem rather weird. Although I may just be thinking that because I haven't done too much experimentation with other metrics.

Do you know of any other metrics that behave similarly or might cause other problems? Also, do you think that another Environment kwarg might be necessary to clear up behavior in situations like this? do_predict_proba seems like half of the solution... But I'm thinking we need one kwarg to declare how predictions should be passed to metrics, then a second to declare how predictions should be saved in a situation like this. I'd love to hear your thoughts!

chwang1991 commented 4 years ago

@HunterMcGushion

Sorry for my late reply, bit busy these days...

I think there are only two metrics that accept multi-column predicted probabilities: log_loss and hinge_loss. I think do_predict_proba is enough, as you already indicated in the document:

If True, it will call models.Model.model.predict_proba(), and the values in all columns will be used as the actual prediction values

I know in most cases there is no need to take proba into consideration, even logloss is mostly applied as loss function rather than metric, but in my recent case I have to evaluate how much "confidence" the model has in the results so I can improve it.

HunterMcGushion / hyperparameter_hunter

logloss issue with multiclass task #197