biocore / BIRDMAn

Bayesian Inferential Regression for Differential Microbiome Analysis
BSD 3-Clause "New" or "Revised" License
22 stars 5 forks source link

Consider sklearn way of fitting #10

Open gibsramen opened 3 years ago

gibsramen commented 3 years ago

Brought up by @gwarmstrong

In sklearn the fit function is where the data is input rather than the model constructor. Would be more intuitive to users familiar with sklearn but could require some rethinking of keeping feature names, sample names, etc.

mortonjt commented 3 years ago

I don't think there is a necessity to be married to the sklearn architecture; the more recent ML frameworks are moving away from that paradigm (particularly Pytorch). It can also be memory hungry if we keep copying data structures in the sklearn way.

For all intents and purposes, if we have a means to compile, fit models in parallel, perform cross-validation and enable predictions, that should be ok.

gwarmstrong commented 3 years ago

The primary concern was that currently in BIRDMAn, fit is an attribute, whereas coming from sklearn, I expect that to be a method that fits the model, which is named differently in BIRDMAn right now.

Also, I do not think PyTorch's interface, or even pytorch-lightning, could be considered "moving away" from the sklearn interface, it is a different interface designed specifically around fitting deep learning models. And it is certainly possible to write adapters for PyTorch Modules to the sklearn Estimator/Predictor interface. See skorch.

One of the huge benefits of sticking to, or at least having an adapter for, the Estimator/Predictor interface is that you get to use a lot of utilities for free from scikit-learn, e.g., grid searching hyperparameter spaces with GridSearchCV, nice integration of pre-processing steps with Pipeline's, and your interface is instantly familiar to the face-loads of people who already use sklearn.

mortonjt commented 3 years ago

Got it. I'm pretty ambivalent to the design choice here; I think the current implementation is certainly off to a good start. But I do recommend testing this paradigm out on a fairly large dataset (i.e. >10k samples) before fleshing out these design decisions. I'm not sure how much merit there is for compatibility with sklearn cross-validation, given how expensive it is to run these types of calculations, particularly with HMC (the average run takes several hours on a small datasets with ~100 samples).

gibsramen commented 3 years ago

I agree that the wrinkle of the Bayesian implementation makes taking advantage of the traditional sklearn suite (potentially) somewhat of a pipe dream. That being said, I think there's a good justification for switching insofar as it makes the API feel more familiar. Switching the API to match the sklearn workflow - even if the underlying behavior doesn't change - could make it easier for people to get up and running without being much of a development hassle.

mortonjt commented 3 years ago

One thing to note to this is that having a single round of cross validation is highly advantageous for posterior predictive checks. I see that the arviz loo and posterior predictive checks appear to address this - but it is not obvious how to specify which samples are used to fit the model and which samples are used for validation.