flatironinstitute / nemos

NEural MOdelS, a statistical modeling framework for neuroscience.
https://nemos.readthedocs.io/
MIT License
78 stars 7 forks source link

scikit-learn dependency #40

Open BalzaniEdoardo opened 1 year ago

BalzaniEdoardo commented 1 year ago

We may think to add sklearn as dependency directly inherit sklearn.BaseEstimator directly instead of implementing our own base class.

The reasoning behind this is the following: 1) less code to maintain 2) ensured compatiblity with meta-data routing, which is still experimental in scikit-learn, but could be potentially useful and hard to implement/maintain on our own.

If we don't care about routing, implementing and maintaining our own get/set is light-work, so I would be fine with maintain compatibility and avoid a dependency.

ahwillia commented 1 year ago

+1 to add sklearn as a dependency.

BalzaniEdoardo commented 1 year ago

Some additional thoughts: score for scikit-learn has to return a scalar; this may create problems if we want to regularize the model (it is basically mandatory with real data) and fit the population jointly. Here is why:

1) the amount of regularization needed for a neuron with shallow tuning will be very different from that of a neuron strongly tuned.

2) cross-validation should learn a regularization hyper-parameter for each neuron. This can be done in parallel by grid-search cross-validation, fitting jointly and returning a score for each neuron instead of a single scalar. For each neuron we can pick the regularizer with the best cv-score. This works because the neuron are assumed independent conditional on the past spike history, so we don't need to grid search over all the combination of the parameters.

3) the score method at point (2) violates scikit-learn api, and is not compatible with sklearn.model_selection module, so we would need to code the cross-validation from scratch.

In my experience calling sklearn GridSearchCV one neuron at the time on the GPU was still way more efficient than fitting the data with sklearn PoissonRegressor or scipy.minimize.

The difference is massive, with V1 recordings of about 10 mins, I was able to fit a 5-fold cv Ridge-GLM on the gpu over the whole population (about 70 neurons) in 2 hours on the gpu, after 2 days PoissonRegressor hadn't finish yet.

Fitting jointly will be likely even faster, but the trade-off is compatibility with sklearn model selection and any pipeline involving a call to score.

One way to go could be to implement a fit_population and score_population, as well as a K-fold grid-search cv that relies on those methods as a fast option. Possibly of fitting neurons in batches (the full population may be still to large in size to fit in any gpu), while keeping fit and score as they are if one wants to rely on scikit-learn pipelines fitting a neuron at the time.

BalzaniEdoardo commented 10 months ago

I am less sure we need the scikit-learn dependency, BaseEstimator is not a generic base class, it has a lot of methods for validating the input and other methods that are scikit-learnspecific and we make no use of them yet. Over all, for 3 method that we need, we end up with a machinery that we either not use, or that is not aligned with our code architecture

Scikit-learn BaseEstimator Methods

Here is a list of methods from the scikit-learn BaseEstimator class along with a brief description for each, including if we need them in nemos.Base [Needed], if they may be useful [Maybe Useful], or if we do not want them [Unwanted]:

Note on Tag in Scikit-learn

In scikit-learn, a tag is a simple way to provide meta-information about an estimator. Tags allow developers to specify certain properties of the estimator, such as whether it requires the target data y during fitting, supports multioutput, or works with sparse matrices. Tags are used internally by scikit-learn to make decisions about handling different estimators in a generic manner. They are defined using the _get_tags() method and can be overridden in custom estimators to provide specific behavior.

BalzaniEdoardo commented 10 months ago

this is an example of what get_tags returns on Poisson regressor

model = lin.PoissonRegressor()
model._get_tags()
Out[6]: 
{'array_api_support': False,
 'non_deterministic': False,
 'requires_positive_X': False,
 'requires_positive_y': True,
 'X_types': ['2darray'],
 'poor_score': False,
 'no_validation': False,
 'multioutput': False,
 'allow_nan': False,
 'stateless': False,
 'multilabel': False,
 '_skip_test': False,
 '_xfail_checks': False,
 'multioutput_only': False,
 'binary_only': False,
 'requires_fit': True,
 'preserves_dtype': [numpy.float64],
 'requires_y': True,
 'pairwise': False}