ClimbsRocks / auto_ml

[UNMAINTAINED] Automated machine learning for analytics & production
http://auto-ml.readthedocs.io
MIT License
1.64k stars 310 forks source link

Multi-label classification with feature learning throws an error #217

Open calz1 opened 7 years ago

calz1 commented 7 years ago

I was trying out using auto_ml to do image recognition via this dataset which is basically 64 pixels of hand-drawn images. This code works fine and generates a trained model I can score.

import matplotlib.pyplot as plt
import pandas as pd
from auto_ml import Predictor
import dill
from sklearn.model_selection import train_test_split
from sklearn import datasets

# import the digits dataset
digits = datasets.load_digits()

# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

df = pd.DataFrame(data[0:,])
df_temp = pd.DataFrame(digits.target[0:,])
df_temp.columns = ['target']

df = df.merge(df_temp,left_index= True, right_index=True)

column_descriptions = {
  'target':'output'
}

df_train, df_test = train_test_split(df, test_size=0.8, random_state=42)
df_train, df_fl = train_test_split(df_train,test_size=0.5,random_state=42)

ml_predictor = Predictor(type_of_estimator='classifier', column_descriptions=column_descriptions )

ml_predictor.train(df_train,model_names='XGBClassifier')

but if you change the last line to:

ml_predictor.train(df_train,model_names='XGBClassifier',feature_learning=True, fl_data=df_fl)

then I get the following. Not sure if it is relevant, but the feature learning epochs didn't seem to do so well and dropped in epoch 2 to accuracy of only .1.

Found 0 null or infinity values in the y values. We will ignore these, and report the score on the rest of the dataset Warning: We have found some values in the predicted probabilities that fall outside the range {0, 1} This is likely the result of a model being trained on too little data, or with a bad set of hyperparameters. If you get this warning while doing a hyperparameter search, for instance, you can probably safely ignore it We will cap those values at 0 or 1 for the purposes of scoring, but you should be careful to have similar safeguards in place in prod if you use this model

ValueError Traceback (most recent call last) /usr/lib/python3.6/site-packages/auto_ml/utils_scoring.py in score(self, estimator, X, y, advanced_scoring) 302 try: --> 303 score = self.scoring_func(y, predictions) 304 except ValueError as e:

/usr/lib64/python3.6/site-packages/sklearn/metrics/classification.py in accuracy_score(y_true, y_pred, normalize, sample_weight) 171 # Compute accuracy for each possible representation --> 172 y_type, y_true, y_pred = _check_targets(y_true, y_pred) 173 if y_type.startswith('multilabel'):

/usr/lib64/python3.6/site-packages/sklearn/metrics/classification.py in _check_targets(y_true, y_pred) 81 raise ValueError("Can't handle mix of {0} and {1}" ---> 82 "".format(type_true, type_pred)) 83

ValueError: Can't handle mix of multiclass and continuous-multioutput

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last) /usr/lib/python3.6/site-packages/auto_ml/utils_scoring.py in score(self, estimator, X, y, advanced_scoring) 314 try: --> 315 score = self.scoring_func(y, predictions) 316 except ValueError:

/usr/lib64/python3.6/site-packages/sklearn/metrics/classification.py in accuracy_score(y_true, y_pred, normalize, sample_weight) 171 # Compute accuracy for each possible representation --> 172 y_type, y_true, y_pred = _check_targets(y_true, y_pred) 173 if y_type.startswith('multilabel'):

/usr/lib64/python3.6/site-packages/sklearn/metrics/classification.py in _check_targets(y_true, y_pred) 81 raise ValueError("Can't handle mix of {0} and {1}" ---> 82 "".format(type_true, type_pred)) 83

ValueError: Can't handle mix of multiclass and continuous-multioutput

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)

in () 1 #Score on test data ----> 2 ml_predictor.score(df_test, df_test.target) /usr/lib/python3.6/site-packages/auto_ml/predictor.py in score(self, X_test, y_test, advanced_scoring, verbose) 1070 return self._scorer.score(y_test, predictions) 1071 elif advanced_scoring: -> 1072 score, probas = self._scorer.score(self.trained_pipeline, X_test, y_test, advanced_scoring=advanced_scoring) 1073 utils_scoring.advanced_scoring_classifiers(probas, y_test, name=self.name) 1074 return score /usr/lib/python3.6/site-packages/auto_ml/utils_scoring.py in score(self, estimator, X, y, advanced_scoring) 316 except ValueError: 317 # Sometimes, particularly for a badly fit model using either too little data, or a really bad set of hyperparameters during a grid search, we can predict probas that are > 1 or < 0. We'll cap those here, while warning the user about them, because they're unlikely to occur in a model that's properly trained with enough data and reasonable params --> 318 predictions = self.clean_probas(predictions) 319 score = self.scoring_func(y, predictions) 320 /usr/lib/python3.6/site-packages/auto_ml/utils_scoring.py in clean_probas(self, probas) 272 print('We will cap those values at 0 or 1 for the purposes of scoring, but you should be careful to have similar safeguards in place in prod if you use this model') 273 if not isinstance(probas[0], list): --> 274 probas = [min(max(pred, 0), 1) for pred in probas] 275 return probas 276 else: /usr/lib/python3.6/site-packages/auto_ml/utils_scoring.py in (.0) 272 print('We will cap those values at 0 or 1 for the purposes of scoring, but you should be careful to have similar safeguards in place in prod if you use this model') 273 if not isinstance(probas[0], list): --> 274 probas = [min(max(pred, 0), 1) for pred in probas] 275 return probas 276 else: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
ClimbsRocks commented 7 years ago

Thanks for the great bug report! Very comprehensive info, which is quite useful.

Unfortunately in this case, multiclass classification just isn't supported by feature_learning yet. It might be as simple as modifying make_deep_learning_classifier() to take in a param for num_output_classes=2 (defaulting to binary classification). It wouldn't be too hard to find the number of classes earlier in the process, and feed them in. I have a feeling that properly supporting multilabel classification for deep learning might be slightly more involved than that, but it might be that simple for feature_learning.

i'm first focusing on .predict_uncertainty() and some better analytics (see https://github.com/ClimbsRocks/auto_ml/issues/218 for an idea I have to get linear-model-style interpretation from much more accurate tree-based models). but if you want to take a whack at this, i'd love to see how it goes, and will happily provide support.

ClimbsRocks commented 7 years ago

@calz1 I also just updated the README to note this. It was a silent failure point before, so thanks for reminding me of this so I could make the docs clearer.