Closed oasidorshin closed 3 years ago
Sorry for the delay, I can confirm this happen with the following code:
import numpy as numpy
import autosklearn.regression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
if __name__ == "__main__":
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
automl_cv = autosklearn.regression.AutoSklearnRegressor(
time_left_for_this_task=60, # In seconds
disable_evaluator_output=False,
resampling_strategy='cv',
resampling_strategy_arguments={'folds': 5},
n_jobs = 2,
memory_limit = 3072
)
automl_cv.fit(X_train, y_train)
predictions = automl_cv.predict(X_test)
print(y_test.shape) # (250, 3)
print(predictions.shape) # (3,5)
I will look into this!
After some more digging, this turns out to be related to how we using the sklearn.VotingRegressor
, which is used to store the cross validation models and get their averaged score'. This does not actually doesn't support multi output regression as seen in the error log produced below and in their fit
method which checks if it's 1d y
. This is specified specified in the fit
documentation.
However we fit models before hand and then manually set the estimators_
value which skips this check.
As this does not seem intended for Multioutput regression, the two solutions I see for autosklearn
are:
# Testing multioutput regression
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# Fit before hand and manually set
models = [DummyRegressor().fit(X_train, y_train) for _ in range(5)]
vr = VotingRegressor(estimators=None)
vr.estimators_ = models
# Raw model outputs are there
print(vr.transform(X_test).shape) # shape (3, 250, 5)
# VotingRegressor averages on wrong dimension for us
print(vr.predict(X_test).shape) # shape (3, 5)
# def predict(...):
# return np.average(self._predict(X), axis=1)
# Manual averaging solution
print(np.average(vr.transform(X_test), axis=2).T.shape)
# Using it as intended causes error
models = [DummyRegressor() for _ in range(5)]
vr = VotingRegressor(estimators=models)
try:
vr.fit(X_train, y_train)
except:
traceback.print_exc()
# python test_voting_regressor.py
(3, 250, 5)
(3, 5)
(250, 3)
Traceback (most recent call last):
File "test_voting_regressor.py", line 33, in <module>
vr.fit(X_train, y_train)
File "/home/skantify/code/asklearn/issue_1169/auto-sklearn/.venv/lib/python3.8/site-packages/sklearn/ensemble/_voting.py", line 484, in fit
y = column_or_1d(y, warn=True)
File "/home/skantify/code/asklearn/issue_1169/auto-sklearn/.venv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/home/skantify/code/asklearn/issue_1169/auto-sklearn/.venv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 921, in column_or_1d
raise ValueError(
ValueError: y should be a 1d array, got an array of shape (750, 3) instead.
Hi @oasidorshin,
The issue has been fixed in PR #1217 and we now test for it and other related situations. This should be in the development branch next week and hopefully in a release in the following week :)
Sounds good, thanks a lot!
Describe the bug
After fitting multioutput regression with cv, shape of predictions is constant (and equal to (number_of_targets, number_of_cv_folds)), regardless of prediction sample shape.
I'm not sure whether this is intended. In any case I think it would be better if this behavior would be more thoroughly explained in the manual.
To Reproduce
Please see attached notebook with code and output. issue_cv.zip
Expected behavior
One dimension of predict() output is equal to sample length.