fabian-sp / GGLasso

A Python package for General Graphical Lasso computation
MIT License
30 stars 14 forks source link

Compatibility with sklearn API #28

Closed papachristoumarios closed 2 years ago

papachristoumarios commented 3 years ago

I have noticed that in the project's source code and documentation methods that are present in similar sklearn classes such as GraphicalLasso have methods (e.g. the fit method) which can be found in the current project but under different names and specifications.

For example, here the GGL class can be created separately and then the data can be provided via a fit method (like they do in sklearn) instead of defining a gglasso_problem with the data and then calling the solve method without any arguments.

Thus, I think that the software can be improved by having method naming and functionality that is similar to sklearn (since sklearn is widely used by the data science / ML community).

fabian-sp commented 3 years ago

Thank you for the input! Indeed we had discussed compatibility with sklearn when designing the package. One of the core ideas of GGLasso was to have one entry point (i.e. glasso_problem) for all problem formulations. We derive the problem formulation from the shape of the input data S (and additional arguments such as latent and reg) which is why we need the input already when creating the problem instance right now. Moreover, as regularization parameters are part of the problem and depend on the input (e.g. via scaling), we thought that it also makes sense to have the input data as part of the problem (even though this differs from sklearn). Regarding the second point, do in understand you correctly that you propose to rename the .solve method to .fit? This would be an easy change (even though I personally think that problems are solved rather than fitted).

jameschapman19 commented 2 years ago

I think I'm inclined to agree with @papachristoumarios that this would be a nice change and could even buy you some functionality for free from scikit-learn. Here I hope is a skeleton example!

class FabianGraphicalLasso(EmpiricalCovariance):
    def __init__(self, score_type='AIC'):
        super().__init__()
        self.score_type = score_type

    def fit(self, X: Union[np.ndarray, Iterable[np.ndarray]], y=None):
        self.N = X.shape[0]
        self.S = empirical_covariance(X, assume_centered=self.assume_centered)
        self._derive_problem_formulation()
        self._solve()  # sklearn has conventions for fitted parameters beginning ._fitted_parameter so this would set these
        # so that
        return self

    def score(self, X_test, y=None):
        S_test = empirical_covariance(X_test - self.location_, assume_centered=True)
        if self.score_type == 'AIC':
            res = self._aic(S_test)
        else:
            res = self._ebic(S_test)
        return res

    def _derive_problem_formulation(self):
        pass

    def _solve(self):
        pass

    def _aic(self, S):
        pass

    def _ebic(self, S):
        pass
jameschapman19 commented 2 years ago

Actually thinking about it X could be your data and y could be labels k for the multiple case!

jameschapman19 commented 2 years ago

So the really cool thing about this route is you could substitute most of your gridsearch functionality for the scikit-learn functionality (i.e. could use random search, gridsearch, various kinds of pipelines)