Multioutput regression with cv, incorrect predict shape?

oasidorshin commented 3 years ago

Describe the bug

After fitting multioutput regression with cv, shape of predictions is constant (and equal to (number_of_targets, number_of_cv_folds)), regardless of prediction sample shape.

I'm not sure whether this is intended. In any case I think it would be better if this behavior would be more thoroughly explained in the manual.

To Reproduce

Please see attached notebook with code and output. issue_cv.zip

Expected behavior

One dimension of predict() output is equal to sample length.

eddiebergman commented 3 years ago

Sorry for the delay, I can confirm this happen with the following code:

import numpy as numpy
import autosklearn.regression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

if __name__ == "__main__":
    X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=3)

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

    automl_cv = autosklearn.regression.AutoSklearnRegressor(
        time_left_for_this_task=60, # In seconds
        disable_evaluator_output=False,
        resampling_strategy='cv', 
        resampling_strategy_arguments={'folds': 5},
        n_jobs = 2,
        memory_limit = 3072
    )
    automl_cv.fit(X_train, y_train)
    predictions = automl_cv.predict(X_test)

    print(y_test.shape) # (250, 3)
    print(predictions.shape) # (3,5)

I will look into this!

eddiebergman commented 3 years ago

After some more digging, this turns out to be related to how we using the sklearn.VotingRegressor, which is used to store the cross validation models and get their averaged score'. This does not actually doesn't support multi output regression as seen in the error log produced below and in their fit method which checks if it's 1d y. This is specified specified in the fit documentation.

However we fit models before hand and then manually set the estimators_ value which skips this check.

As this does not seem intended for Multioutput regression, the two solutions I see for autosklearn are:

To run our own Ensemble based class for cross validation models, to do these predictions for all different task types.
Just manually do this the averaging if the task is multioutput regression, which questions why use the VotingRegressor then?

# Testing multioutput regression
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Fit before hand and manually set
models = [DummyRegressor().fit(X_train, y_train) for _ in range(5)]
vr = VotingRegressor(estimators=None)
vr.estimators_ = models

# Raw model outputs are there
print(vr.transform(X_test).shape) # shape (3, 250, 5)

# VotingRegressor averages on wrong dimension for us
print(vr.predict(X_test).shape) # shape (3, 5)
# def predict(...):
#   return np.average(self._predict(X), axis=1)

# Manual averaging solution
print(np.average(vr.transform(X_test), axis=2).T.shape)

# Using it as intended causes error
models = [DummyRegressor() for _ in range(5)]
vr = VotingRegressor(estimators=models)
try:
    vr.fit(X_train, y_train)
except:
    traceback.print_exc()

# python test_voting_regressor.py
(3, 250, 5)                                                                                                                                                                                   
(3, 5)                                                                                                                                                                                        
(250, 3) 
Traceback (most recent call last):                                                                                                                                                            
  File "test_voting_regressor.py", line 33, in <module>                                                                                                                                       
    vr.fit(X_train, y_train)                                                                                                                                                                  
  File "/home/skantify/code/asklearn/issue_1169/auto-sklearn/.venv/lib/python3.8/site-packages/sklearn/ensemble/_voting.py", line 484, in fit                                                 
    y = column_or_1d(y, warn=True)                                                                                                                                                            
  File "/home/skantify/code/asklearn/issue_1169/auto-sklearn/.venv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f                                              
    return f(*args, **kwargs)                                                                                                                                                                 
  File "/home/skantify/code/asklearn/issue_1169/auto-sklearn/.venv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 921, in column_or_1d                                        
    raise ValueError(                                                                                                                                                                         
ValueError: y should be a 1d array, got an array of shape (750, 3) instead.

eddiebergman commented 3 years ago

Hi @oasidorshin,

The issue has been fixed in PR #1217 and we now test for it and other related situations. This should be in the development branch next week and hopefully in a release in the following week :)

oasidorshin commented 3 years ago

Sounds good, thanks a lot!

automl / auto-sklearn