alteryx / evalml

EvalML is an AutoML library written in python.
https://evalml.alteryx.com
BSD 3-Clause "New" or "Revised" License
772 stars 86 forks source link

Handle component graph case of different input lengths into Estimator #2823

Open christopherbunn opened 3 years ago

christopherbunn commented 3 years ago

During the development of the stacked ensembler component #1930, we realized that during the ensembling of pipelines there was the potential for the final ensembler to have multiple input components with varying length.

One such example was the fitting of an ensembling pipeline that is composed of two pipelines: one with an Oversampler and another without one. In this case, the Oversampler would add additional rows not present in the other pipeline. Thus, for the input_x of the metalearner, it would have NaN values for the missing rows. In turn, this would raise an error in the metalearner.

The current suggested solution is to detect when when there are NaNs in the input_x of an Estimator component and raise an error saying that this is a maligned component graph.

cc: @angela97lin

tyler3991 commented 3 years ago

@christopherbunn, hey, we are in refinement right now talking about this. Could you put some repro steps and/or repro code?

christopherbunn commented 3 years ago

@tyler3991, here's a repro example:

import evalml
from evalml import AutoMLSearch
from sklearn import datasets
import pytest

X, y = datasets.make_classification(n_samples=100, n_features=20, weights={0: 0.1, 1: 0.9}, random_state=0)

X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='binary', test_size=.2)

cg = {
    'OS': ['Oversampler', 'X', 'y'],
    'RF': ['Random Forest Classifier', 'OS.x', 'OS.y'],
    'EN': ['Elastic Net Classifier', 'X', 'y'],
    'EN_2': ['Elastic Net Classifier', 'RF.x', 'EN.x', 'OS.y'],
}

pl = evalml.pipelines.BinaryClassificationPipeline(cg, parameters={'OS': {'sampling_ratio': 0.5}})

y_train.value_counts()

with pytest.raises(ValueError, match="Input contains NaN, infinity or a value too large for dtype\('float64'\)."):
    pl.fit(X_train, y_train)

Here, RF has an oversampler that adds additional rows to the data. When this is combined with the output of EN2, the input frame that is created for the EN_2 estimator has NaN values that are filled in for the EN column.