dssg / eights

Data Science template with focus on prewritten workflows
14 stars 2 forks source link

simple error, corner case #35

Closed johnsanterre closed 8 years ago

johnsanterre commented 9 years ago

import numpy as np; import eights as e; from sklearn.ensemble import RandomForestClassifierexp >>> e.operate.simple_clf(np.array([[1,1,1],[1,2,3]]), np.array([1,1,1]),RandomForestClassifier()) exp.make_report() exp.make_report() Traceback (most recent call last): File "", line 1, in File "eights/perambulate/perambulate.py", line 162, in make_report sub_rep.add_summary_graph_roc_auc() File "eights/communicate/communicate.py", line 470, in add_summary_graph_roc_auc self.add_summary_graph('roc_auc') File "eights/communicate/communicate.py", line 459, in add_summary_graph maxy = max(y) ValueError: max() arg is an empty sequence

zar1 commented 9 years ago

Your labels column has 3 rows but M only has 2 rows. Please make sure that isn't the problem.

johnsanterre commented 9 years ago

Failing gracefully is Pythonic.

Same error regardless.

johnsanterre commented 9 years ago

Not a corner case, also failing on reasonable sized Matrix.

zar1 commented 9 years ago

Also, please try {RandomForestClassifier: {}}, rather than RandomForestClassifier()

johnsanterre commented 9 years ago

exp = e.operate.simple_clf(np.array([[1,1,1],[1,2,3]]), np.array([1,1]),{RandomForestClassifier: {}}) exp.make_report() Traceback (most recent call last): File "", line 1, in File "eights/perambulate/perambulate.py", line 162, in make_report sub_rep.add_summary_graph_roc_auc() File "eights/communicate/communicate.py", line 470, in add_summary_graph_roc_auc self.add_summary_graph('roc_auc') File "eights/communicate/communicate.py", line 450, in add_summary_graph trial, score in getattr(self.exp, measure)().iteritems()] File "eights/perambulate/perambulate.py", line 136, in roc_auc return {trial: trial.roc_auc() for trial in self.trials} File "eights/perambulate/perambulate.py", line 136, in return {trial: trial.roc_auc() for trial in self.trials} File "eights/perambulate/perambulate_helper.py", line 590, in roc_auc return self.median_run().roc_auc() File "eights/perambulate/perambulate_helper.py", line 415, in roc_auc return roc_auc_score(self.__test_y(), self.pred_proba()) File "eights/perambulate/perambulate_helper.py", line 330, in pred_proba return self.clf.predict_proba(self.test_M())[:,1] IndexError: index 1 is out of bounds for axis 1 with size 1

zar1 commented 9 years ago

I think there's a few issues here related to stratification. In the test case, there is only one category presented, but the line that's throwing the error expects there to be two categories. When we change it to make 2 categories (below) then we run into a similar issue. The other issue is probably because the KFold cross-validation is selecting subsets of the data that, again, only have one label.

>>> exp = e.operate.simple_clf(np.array([[1,1,1],[1,2,3]]), np.array([1,0]),{RandomForestClassifier: {}})
>>> exp.make_report()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/zar1/dssg/eights/eights/perambulate/perambulate.py", line 162, in make_report
    sub_rep.add_summary_graph_roc_auc()
  File "/Users/zar1/dssg/eights/eights/communicate/communicate.py", line 470, in add_summary_graph_roc_auc
    self.add_summary_graph('roc_auc')
  File "/Users/zar1/dssg/eights/eights/communicate/communicate.py", line 450, in add_summary_graph
    trial, score in getattr(self.__exp, measure)().iteritems()]
  File "/Users/zar1/dssg/eights/eights/perambulate/perambulate.py", line 136, in roc_auc
    return {trial: trial.roc_auc() for trial in self.trials}
  File "/Users/zar1/dssg/eights/eights/perambulate/perambulate.py", line 136, in <dictcomp>
    return {trial: trial.roc_auc() for trial in self.trials}
  File "/Users/zar1/dssg/eights/eights/perambulate/perambulate_helper.py", line 590, in roc_auc
    return self.median_run().roc_auc()
  File "/Users/zar1/dssg/eights/eights/perambulate/perambulate_helper.py", line 415, in roc_auc
    return roc_auc_score(self.__test_y(), self.__pred_proba())
  File "/Library/Python/2.7/site-packages/sklearn/metrics/metrics.py", line 593, in roc_auc_score
    sample_weight=sample_weight)
  File "/Library/Python/2.7/site-packages/sklearn/metrics/metrics.py", line 473, in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
  File "/Library/Python/2.7/site-packages/sklearn/metrics/metrics.py", line 584, in _binary_roc_auc_score
    raise ValueError("Only one class present in y_true. ROC AUC score "
ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.
johnsanterre commented 9 years ago

In the case of two labels:

exp = e.operate.simple_clf(np.array([[1,1,1],[1,2,3]]), np.array([1,2]),{RandomForestClassifier: {}}) exp.make_report() Traceback (most recent call last): File "", line 1, in File "eights/perambulate/perambulate.py", line 162, in make_report sub_rep.add_summary_graph_roc_auc() File "eights/communicate/communicate.py", line 470, in add_summary_graph_roc_auc self.add_summary_graph('roc_auc') File "eights/communicate/communicate.py", line 450, in add_summary_graph trial, score in getattr(self.exp, measure)().iteritems()] File "eights/perambulate/perambulate.py", line 136, in roc_auc return {trial: trial.roc_auc() for trial in self.trials} File "eights/perambulate/perambulate.py", line 136, in return {trial: trial.roc_auc() for trial in self.trials} File "eights/perambulate/perambulate_helper.py", line 590, in roc_auc return self.median_run().roc_auc() File "eights/perambulate/perambulate_helper.py", line 415, in roc_auc return roc_auc_score(self.__test_y(), self.pred_proba()) File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/metrics.py", line 593, in roc_auc_score sample_weight=sample_weight) File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/metrics.py", line 473, in _average_binary_score return binary_metric(y_true, y_score, sample_weight=sample_weight) File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/metrics.py", line 584, in _binary_roc_auc_score raise ValueError("Only one class present in y_true. ROC AUC score " ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

zar1 commented 9 years ago

After further investigation, the cross-validation algo used for simple_clf is eights.perambulate.perambulate_helper.NoCV, which reserves no test set. Because it reserves no test set, any number of things requiring a test set doesn't work. Probably, what it should do is use sklearn's cross-validate to return a single fold which has both a train and a test set in it.

zar1 commented 9 years ago

There are two things we need to do to resolve this:

  1. perambulate.Experiment should through friendly errors when we're trying to run evaluations without having a test set
  2. operate.simple_* should either use real CVs by default or should have documentation stating that you can't use them for evaluation.
zar1 commented 8 years ago

These are solved by diogenes b0e2f8c