alegonz / baikal

A graph-based functional API for building complex scikit-learn pipelines.
https://baikal.readthedocs.io
BSD 3-Clause "New" or "Revised" License
592 stars 30 forks source link

Example/Help with Dealing with MultiLabel + Stacked Classifier Case #49

Closed DMTSource closed 3 years ago

DMTSource commented 3 years ago

The docs shows an example and explanation of working with stacked classifiers and how to use attr_dict + predict_proba to avoid ovefit. I am attempting to implement this usage of predict_proba after working with my model in a case of multilabel classification, but I am facing some challenges with this upgrade.

In the docs the issue is fixed with the use of _drop_firstcol lambda which fixes the issue for a single class inference with predict proba. I have created a similar lambda which works to do the same task but for each class(see example below of predict_proba output to illustrate).

multilabel_proba_reduced = Lambda(lambda prpr: np.array([prpr[i][:, 1:].flatten() for i in range(n_classes)]).T) Which returns (n_samples, n_classes) from the predict_proba operation.

After I got the model to work in training, the prediction step is now throwing errors. I think im close but my solution, similar to the doc example's 'drop_first_col' lambda, is hidden inside the overridden 'fit_predict' function. This is why I am guessing things break at predict time, as the operation is not in the graph outside of fitting. When I attempt fix this like in the example via lambda, I ran into many issues trying to get things right for training step, and went in circles.

To illustrate the primary difference, we get the 2 probabilities for each class, so a y sample takes the form:

y_test[0] == [1 1 0]
y_test[1] == [0 1 0]

Then the output from predict_proba takes on the form:

[
        array([[ 0.46147748,  0.53852252],
                          [ 0.46147748,  0.53852252],
                  [ 0.52721207,  0.47278793]]), 

        array([[ 0.55917461, 0.44082539], 
                          [ 0.44082539,  0.55917461], 
                  [ 0.50852903,  0.49147097]])
]

Describe the solution you'd like I have that can run but it crashes on predict step after training, the error is share as well below the code in a comment. Any suggestions for getting around my confusing with tying to tie together the first and second level classifiers would be very much appreciated!

PLEASE SEE THE FULL EXAMPLE CODE HERE https://gist.github.com/DMTSource/368b09e2c7f780f1355606f6e716d197

Terminal error here: https://gist.github.com/DMTSource/368b09e2c7f780f1355606f6e716d197#gistcomment-3677803

alegonz commented 3 years ago

There is a lot going on this script and will take me a bit to look at it in detail. At first glance I don't even understand why the fit step is succeeding in the first place. I have my hands a bit full at the moment, I'll get back to you possibly over the weekend.

DMTSource commented 3 years ago

Sorry about the script it does appear to be a bit of a mess. I will try to clean it up and rephrase some things to hopefully save you some time:

In the above script I had to throw in a bunch of column stacks to get the outputs of RandomForestClassifier work due to errors with the multi label output. But as you touched on, my magically getting fit to work in this way was not successful.

Please ignore the first script, here is the same process/attempt but I applied it as closely as I could to the "Stacked classifiers (standard protocol)" example so its easier to follow what I am trying to do i.e. work with multilabel outputs via the predict_proba route. https://gist.github.com/DMTSource/26b6a386a6ba54f23d0ae0a9d22ddbfa

I think my big issue issue/mistake(as mentioned before) is instead of a Lambda function in the graph, I sort out the multilabel extraction(leave out 1 of the 2 values that sum to 1) inside the 'fit_predict' operation. This probably means when its time to predict(where the crash occurs) there is no operation in the graph to perform this reduction and we see a shape error of some kind. But I am having trouble trying to use Lambda in this manner, as the first layer of classifiers then complain about their outputs being the wrong shape due to the multi label output.

alegonz commented 3 years ago

Thank you for simplifying the script!

Yes, you are right. The output of fit_predict (a single array) and the output of predict (a list of arrays) is not consistent. During fit, since fit_predict (with the custom stacking of multi-output probabilities) is used ColumnStack does not complain because is only receiving a single array. But during predict, predict_proba is used instead and that returns a list of several arrays which conflicts with the expected number of outputs (just one).

I guess there are other ways if you play around somehow with Lambdas or ColumnStacks, but I think the easiest way is to override the API of RandomForestClassifier:

import numpy as np

import sklearn.ensemble
import sklearn.linear_model
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict, train_test_split

from baikal import Input, Model, make_step
from baikal.plot import plot_model
from baikal.steps import ColumnStack

def stack_multioutput_proba(mop):
    return np.column_stack([c[:, 1:] for c in mop])  # or :-1 if you prefer to drop the last

def predict_proba_stacked(self, X):
    # NOTE: plain super() does not work
    mop = super(RandomForestClassifier, self).predict_proba(X)
    return stack_multioutput_proba(mop)

def fit_predict(self, X, y):
    self.fit(X, y)
    cvp = cross_val_predict(self, X, y, method="predict_proba")  # note that this NOT predict_proba_stacked
    return stack_multioutput_proba(cvp)

attr_dict = {"fit_predict": fit_predict, "predict_proba_stacked": predict_proba_stacked}
RandomForestClassifier = make_step(sklearn.ensemble.RandomForestClassifier, attr_dict)
ExtraTreesClassifier = make_step(sklearn.ensemble.ExtraTreesClassifier)

# ------- Random Multilabel dataset
np.random.set_state(np.random.RandomState(0).get_state())
X = np.random.random((1000, 50)) # feature array
y_p = np.random.randint(0,2, (1000, 2)) # miltilabel aray, ex sample: [1 1] aka both classes detected
X_train, X_test, y_train, y_test = train_test_split(
    X, y_p, test_size=0.2, random_state=0
)

# ------- Build model
x = Input()
y_t = Input()
y_p1 = RandomForestClassifier(random_state=0)(x, y_t, compute_func="predict_proba_stacked")
y_p2 = RandomForestClassifier(random_state=0)(x, y_t, compute_func="predict_proba_stacked")

stacked_features = ColumnStack()([y_p1, y_p2])
y_p = ExtraTreesClassifier(random_state=0)(stacked_features, y_t)

model = Model(x, y_p, y_t)
plot_model(model, filename="stacked_classifiers_standard.png", dpi=96)

# ------- Train model
model.fit(X_train, y_train)

# ------- Evaluate model
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

print("F1 score on train data:", f1_score(y_train, y_train_pred, average=None))
print("F1 score on test data:", f1_score(y_test, y_test_pred, average=None))

Essentially, override the original API of RandomForestClassifier to something that is more easily handled by your application. In this case note that instead of overriding predict_proba I created another predict_proba_stacked that stacks the outputs. This is because cross_val_predict (a function native of scikti-learn) expects the original predict_proba that gives a list of outputs.

Also note that since I'm using super perhaps it would be easier and more readable to just use the sub-classing style (inheriting from Step and the classifier class) instead of using make_step.

alegonz commented 3 years ago

By the way, note that steps accept a n_outputs argument that is meant precisely for these cases. That argument allows you to specify the number of outputs you expect from the step (2 in this example). I haven't tried it, but if you specify n_outputs=2, you should be able to do it without overriding predict_proba and without stacking the outputs within fit_predict. fit_predict could just return the list of arrays just like predict_proba, and then do the column stacking with Lambdas and ColumnStack steps.

alegonz commented 3 years ago

If the either of the above solutions work and help you achieve what you want, it would be nice to add a new example of this use case :)

alegonz commented 3 years ago

For completeness here is the same model implemented using n_outputs and Lambda steps. You can confirm that it produces the same results as the script above. On second thought, this style seems simpler.

import numpy as np

import sklearn.ensemble
import sklearn.linear_model
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict, train_test_split

from baikal import Input, Model, make_step
from baikal.plot import plot_model
from baikal.steps import ColumnStack, Lambda

def fit_predict(self, X, y):
    self.fit(X, y)
    return cross_val_predict(self, X, y, method="predict_proba")

attr_dict = {"fit_predict": fit_predict}
RandomForestClassifier = make_step(sklearn.ensemble.RandomForestClassifier, attr_dict)
ExtraTreesClassifier = make_step(sklearn.ensemble.ExtraTreesClassifier)

# ------- Random Multilabel dataset
np.random.set_state(np.random.RandomState(0).get_state())

n_outputs = 2
X = np.random.random((1000, 50))  # feature array
y_p = np.random.randint(0, n_outputs, (1000, n_outputs))  # miltilabel aray, ex sample: [1 1] aka both classes detected
X_train, X_test, y_train, y_test = train_test_split(
    X, y_p, test_size=0.2, random_state=0
)

# ------- Build model
# The model is built similarly as the naive case. The difference is that during fit
# baikal will detect and use the fit_predict method above.
x = Input()
y_t = Input()
y_p1 = RandomForestClassifier(random_state=0, n_outputs=n_outputs)(x, y_t, compute_func="predict_proba")
y_p2 = RandomForestClassifier(random_state=0, n_outputs=n_outputs)(x, y_t, compute_func="predict_proba")

stack_multioutput_proba = Lambda(lambda mop: np.column_stack([c[:, 1:] for c in mop]))
y_p1 = stack_multioutput_proba(y_p1)
y_p2 = stack_multioutput_proba(y_p2)

stacked_features = ColumnStack()([y_p1, y_p2])
y_p = ExtraTreesClassifier(random_state=0)(stacked_features, y_t)

model = Model(x, y_p, y_t)
plot_model(model, filename="stacked_classifiers_standard_2.png", dpi=96)

# ------- Train model
model.fit(X_train, y_train)

# ------- Evaluate model
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

print("F1 score on train data:", f1_score(y_train, y_train_pred, average=None))
print("F1 score on test data:", f1_score(y_test, y_test_pred, average=None))
alegonz commented 3 years ago

The solutions above should solve the issue so I'll close this. If the issue is not solved feel free to reopen.