alegonz / baikal

A graph-based functional API for building complex scikit-learn pipelines.
https://baikal.readthedocs.io
BSD 3-Clause "New" or "Revised" License
592 stars 30 forks source link

GridSearchCV with multiple inputs issue #50

Closed DMTSource closed 1 year ago

DMTSource commented 3 years ago

I am attempting to switch a working model based on _readme_longexample to use a GridSearchCV fit, but when I apply the fit, the gscv does not appear to like my multiple inputs and gives a new error(before I was able to fit and predict my model):

For example: gscv.fit([X1_train, X2_train, X3_train], Y1_train, **fit_params)

# Console output showing shape of x's/y and the error that appears now when trying to use GridSearchCV.
Preparing baikal model...
X1 Shape: (14206, 478)
X2 Shape: (14206, 14)
X3 Shape: (14206, 508)
Y1 Shape: (14206, 2)

Training baikal model...
Traceback (most recent call last):
  File "train.py", line 130, in <module>
    main()
  File "train.py", line 112, in main
    mode='classification')
  File "<edited>dir/model.py", line 246, in train
    gscv.fit([X1_train, X2_train, X3_train], Y1_train, **fit_params)
  File "/home/anaconda3/envs/baikal/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/home/anaconda3/envs/baikal/lib/python3.7/site-packages/sklearn/model_selection/_search.py", line 759, in fit
    X, y, groups = indexable(X, y, groups)
  File "/home/anaconda3/envs/baikal/lib/python3.7/site-packages/sklearn/utils/validation.py", line 299, in indexable
    check_consistent_length(*result)
  File "/home/anaconda3/envs/baikal/lib/python3.7/site-packages/sklearn/utils/validation.py", line 263, in check_consistent_length
    " samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [3, 14206]

How to reproduce it? I have modified a new example to show this behavior is occurring in the readme_long_example as well. It also gives the error that suggest the input shape is related: ValueError: Found input variables with inconsistent numbers of samples: [2, 426]

The runnable, modified example can be found here: https://gist.github.com/DMTSource/2b38b473270a50e71025dd6cb1c03521

What versions are you using? baikal==0.4.2 scikit-learn==0.24.1 Python 3.7.6 (anaconda env)

alegonz commented 3 years ago

Hi Derek,

Thank you for providing the reproducing examples. I think the issue here is that GridSearchCV as implemented by sklearn is meant for single inputs. I realize that other than the rather obscure comment in SklearnWrapper quoted below it is not obvious that you cannot pass muti-input/multi-outputs when using SklearnWrapper + GridSearchCV. I'll improve the docs to make this more obvious.

class SKLearnWrapper:
    """Wrapper utility class that allows models to used in scikit-learn's
    ``GridSearchCV`` API. It follows the style of Keras' own wrapper.

    A future release of **baikal** plans to remove this class and instead
    include a custom ``GridSearchCV`` API, based on the original scikit-learn
    implementation, that can handle baikal models natively.
    """

In the meantime I think you can work around it by merging the multiple inputs (and multiple outputs, if any) before feeding them into the model, and then and doing the splitting within model with Split and then Stack-ing the outputs.

DMTSource commented 3 years ago

Thank you for the quick response! I will give the workaround a try as that sounds like a simple/great solution!

alegonz commented 1 year ago

Closing due to inactivity. Feel free to reopen if you need further help.