Consider making bhmm core compatible with sklearn

franknoe commented 9 years ago

Following #40:

"could optimally be made compatible with scikit-learn, which seems to be the modern way new machine learning codes are written."

It seems all we need to do to make the HMM estimators compatible with sklearn is to implement the BaseEstimator interface: http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html#sklearn.base.BaseEstimator Perhaps we could also implement the ClassifierMixin behavior (score) in order to do cross-validation.
Also I guess it would make sense to have the fit(X,..) function, but I'm not sure where in sklearn that function is actually defined, i.e. is there any base class defining it, or is it just a duck-typing convention?

This is both very easy to do, but I don't see why we would need a dependency on sklearn for that. I would like to avoid dependencies on heavy packages unless we don't use some of their functionalities (currently we do use the sklearn's Gaussian Mixture Model, but I guess this is a temporary solution).

@kylebeauchamp is dependency on sklearn necessary to have an effective implementation of the BaseEstimator class (i.e. is issubclass somewhere checked explicitly in sklearn algorithms), or is duck-typing sufficient?

franknoe commented 9 years ago

Small addition: Even if explicit subclassing is necessary in order to use our estimators with sklearn algorithms, it would be simple to do that in another package that uses sklearn. Something like (just a sketch, I haven't tried that code):

from bhmm import MLHMM
from sklearn.base import BaseEstimator

class myMLHMM(MLHMM, BaseEstimator):
    pass

hmm = myMLHMM()
# do some fancy sklearn stuff with hmm object here

That way we could avoid explicit dependency on sklearn in bhmm and still provide all functionalities.

marscher commented 9 years ago

The current release (0.3.0) defines scikit-learn as a dependency (both in setup.py and conda recipe), however it is never being used in the code. Is this intended?

franknoe commented 9 years ago

I think we use sklearn's Gaussian Mixture Model estimator for the initialization of Gaussian HMMs. This was meant to be a temporary solution. I think we can in principle make bhmm completely sklearn-compatible without any dependency to sklearn, like I have started to do it for pyEMMA.

Am 06/07/15 um 16:26 schrieb Martin K. Scherer:

The current release (0.3.0) defines scikit-learn as a dependency (both in setup.py and conda recipe), however it is never being used in the code. Is this intended?

— Reply to this email directly or view it on GitHub https://github.com/bhmm/bhmm/issues/42#issuecomment-118872021.

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

franknoe commented 9 years ago

Addition: sklearn is used in l. 37-39 of this file: https://github.com/bhmm/bhmm/blob/master/bhmm/init/gaussian.py I realize that we would inherit the sklearn dependency in pyemma if pyemma depends on bhmm. Perhaps we could just make a copy of the sklearn gmm estimation for now which is easy to isolate? sklearn is open bsd, so there should be no problem with that.

marscher commented 9 years ago

Sounds reasonable, if it is easy to extract.

franknoe commented 9 years ago

OK, please have a look

Am 07/07/15 um 00:53 schrieb Martin K. Scherer:

Sounds reasonable, if it is easy to extract.

— Reply to this email directly or view it on GitHub https://github.com/bhmm/bhmm/issues/42#issuecomment-119021940.

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

marscher commented 9 years ago

It also uses KMeans for initializing the means during EM. So we would need to extract this stuff too. The downside of extracting it is also we do not easily receive updates/bugfixes for that code. For those two reason I would dis-advise extracting.

franknoe commented 9 years ago

I'll have a look. We certainly don't need k-means initialization at the moment for our bhmm purposes. I'd be happy to use dependencies to package that we take significant advantage of, but this is not the case at the moment.

I definitely want to avoid unnecessary dependencies for pyemma. There's no headache for the anaconda install but with pip there are tons of possible problems and every dependency adds to them.

Am 07/07/15 um 01:10 schrieb Martin K. Scherer:

It also uses KMeans for initializing the means during EM. So we would need to extract this stuff too. The downside of extracting it is also we do not easily receive updates/bugfixes for that code. For those two reason I would dis-advise extracting.

— Reply to this email directly or view it on GitHub https://github.com/bhmm/bhmm/issues/42#issuecomment-119024184.

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

franknoe commented 9 years ago

We can instead include Moritz' code. It's faster anyway. In principle I think the initialization step is sufficient.

Am 07/07/15 um 01:10 schrieb Martin K. Scherer:

It also uses KMeans for initializing the means during EM. So we would need to extract this stuff too. The downside of extracting it is also we do not easily receive updates/bugfixes for that code. For those two reason I would dis-advise extracting.

— Reply to this email directly or view it on GitHub https://github.com/bhmm/bhmm/issues/42#issuecomment-119024184.

Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

bhmm / legacy-bhmm-force-spectroscopy-manuscript