VowpalWabbit / vowpal_wabbit

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
https://vowpalwabbit.org
Other
8.49k stars 1.93k forks source link

Problem in Implementing VWRegressor model and Other VWClassifer model #2300

Open RituRajSingh878 opened 4 years ago

RituRajSingh878 commented 4 years ago

I am trying to integrate different models of sklearn with VWRegerssor and VWClassifier. But I am a little bit confused with VW and How it works.

Let's assume I am trying to implement/integrate multiclass classifier in VWClassifier then How should I proceed with VW- https://github.com/VowpalWabbit/vowpal_wabbit/blob/3c0f0ffb84c47701521abba6450a8cf7b39fba83/python/vowpalwabbit/sklearn_vw.py#L505

In sklearn, they have made different classes for different types of models so should I proceed like them or I have to add every classifier model to VWClassifier and every regressor model to VWRegressor(If yes, then, I don't have any idea how should I proceed).

sklearn models- https://github.com/scikit-learn/scikit-learn/blob/b194674c42d54b26137a456c510c5fdba1ba23e0/sklearn/linear_model/_logistic.py#L1191

I have implemented a Regressor(similear to Classifier model) model in VMRegressor-

index 6c7468b2..5d62b078 100644
--- a/python/vowpalwabbit/sklearn_vw.py
+++ b/python/vowpalwabbit/sklearn_vw.py
@@ -552,11 +552,44 @@ class VWClassifier(SparseCoefMixin, ThresholdingLinearClassifierMixin, VW):

         return VW.predict(self, X=X)

+class LinearRegressorMixin(LinearClassifierMixin):

-class VWRegressor(VW, RegressorMixin):
+    """Mixin for linear Regression.
+
+    Handles prediction for sparse and dense X.
+    """
+
+    def __init__(self, **params):
+
+        super(LinearRegressorMixin, self).__init__(**params)
+
+    def predict(self, X):
+
+        return self.decision_function(X)
+
+class VWRegressor(SparseCoefMixin, LinearRegressorMixin, VW):
     """Vowpal Wabbit Regressor model """

-    pass
+    def __init__(self, **params):
+        super(VWRegressor, self).__init__(**params)
+
+    def predict(self, X):
+        return LinearRegressorMixin.predict(self, X=X)
+
+    def decision_function(self, X):
+        """
+        Parameters
+        ----------
+        X : {array-like, sparse matrix}, shape = (n_samples, n_features)
+            Samples.
+
+        Returns
+        -------
+        array
+
+        """
+
+        return VW.predict(self, X=X)

I will open a pr if it looks good to you.

Another question is, I wish to implement theLogisticRegressor model, then How should I proceed in VW(a general idea or formate that VW follows)?

I am trying to understanding VW.

jackgerrits commented 4 years ago

@gramhagen are you able to help answer this? Does a logistic regressor make sense given VW is pretty much modeling with a linear function?

RituRajSingh878 commented 4 years ago

I think, then logistic will not make any sense here. But let's assume, I wish to add linear models like Laso, ridge,linear regression then?. I want to extend the VWRegressor and VWClassifier with different sklearn models. Also, VWClassifier doesn't support multiclass classification so if I want to extend it then how should I proceed?

gramhagen commented 4 years ago

@RituRajSingh878 you can change the loss function, regularization and a few other parameters used with the current implementation of VWRegressor. See options here: https://github.com/VowpalWabbit/vowpal_wabbit/blob/e3f4038dbfb890058ff128d69a44ffec7c251236/python/vowpalwabbit/sklearn_vw.py#L186

However, the sklearn implementation is just a wrapper around the pyvw binding, which in turn is just a wrapper around the core VW functionality. So we are limited to what the core library is capable of. To introduce new functionality would require changes similar to what was done in PR #1001

I'm not sure I follow the differences implemented vs what we have now. Do you have an example use for the VWRegressor? Either what you can do with the new code, or what you can't do with the old one (beside changing the underlying algorithm to Lasso, ridge, etc.) that would help me understand.

Lastly, supporting multiclass in the VWClassifier would be great! The base pyvw class does support this, so it shouldn't be too bad to add sklearn api support as well. Take a look at this test case: https://github.com/VowpalWabbit/vowpal_wabbit/blob/e3f4038dbfb890058ff128d69a44ffec7c251236/python/tests/test_pyvw.py#L73 and the sklearn implementation for a multiclass estimator: https://github.com/scikit-learn/scikit-learn/blob/b194674c42d54b26137a456c510c5fdba1ba23e0/sklearn/multiclass.py#L132

I'm thinking it might be easier to add a new VWMultiClassifier class for this? Happy to help you through this is you want to take a stab at something.

RituRajSingh878 commented 4 years ago

@RituRajSingh878 you can change the loss function, regularization and a few other parameters used with the current implementation of VWRegressor. See options here:

https://github.com/VowpalWabbit/vowpal_wabbit/blob/e3f4038dbfb890058ff128d69a44ffec7c251236/python/vowpalwabbit/sklearn_vw.py#L186

However, the sklearn implementation is just a wrapper around the pyvw binding, which in turn is just a wrapper around the core VW functionality. So we are limited to what the core library is capable of. To introduce new functionality would require changes similar to what was done in PR #1001

I will have to look more into this so that I can understand it clearly after that I will try to work around this.

I'm not sure I follow the differences implemented vs what we have now. Do you have an example use for the VWRegressor? Either what you can do with the new code, or what you can't do with the old one (beside changing the underlying algorithm to Lasso, ridge, etc.) that would help me understand.

Initially, I misunderstood the working of VWRegressor but I get it now.

I'm thinking it might be easier to add a new VWMultiClassifier class for this? Happy to help you through this is you want to take a stab at something.

I will try to add VWMulticlassiifer and will open a pr for the same.

RituRajSingh878 commented 4 years ago

I am trying to implement like this-

+def _fit_binary(estimator, X, y, classes=None):
+    """Fit a single binary estimator."""
+    unique_y = np.unique(y)
+    if len(unique_y) == 1:
+        if classes is not None:
+            if y[0] == -1:
+                c = 0
+            else:
+                c = y[0]
+            warnings.warn("Label %s is present in all training examples." %
+                          str(classes[c]))
+        estimator = _ConstantPredictor().fit(X, unique_y)
+    else:
+        estimator = clone(estimator)
+        estimator.fit(X, y)
+    return estimator
+
+
+class VWMultiClassifier(MultiOutputMixin, ClassifierMixin, MetaEstimatorMixin, VW):
+    """Vowpal Wabbit MultiClassifier model """
+
+    def __init__(self, estimator, n_jobs):
+        self.estimator = estimator
+        self.n_jobs = n_jobs
+
+    def fit(self, X, y):
+        """Fit underlying estimators.
+        Parameters
+        ----------
+        X : (sparse) array-like of shape (n_samples, n_features)
+            Data.
+        y : (sparse) array-like of shape (n_samples,) or (n_samples, n_classes)
+            Multi-class targets. An indicator matrix turns on multilabel
+            classification.
+        Returns
+        -------
+        self
+        """
+        self.label_binarizer_ = LabelBinarizer(sparse_output=True)
+        Y = self.label_binarizer_.fit_transform(y)
+        Y = Y.tocsc()
+        self.classes_ = self.label_binarizer_.classes_
+        columns = (col.toarray().ravel() for col in Y.T)
+        # In cases where individual estimators are very fast to train setting
+        # n_jobs > 1 in can results in slower performance due to the overhead
+        # of spawning threads.  See joblib issue #112.
+        self.estimators_ = Parallel(n_jobs=self.n_jobs)(delayed(_fit_binary)(
+            self.estimator, X, column, classes=[
+                "not %s" % self.label_binarizer_.classes_[i],
+                self.label_binarizer_.classes_[i]])
+            for i, column in enumerate(columns))
+
+        return self
+

But before opening a pr, I want to check it with you that I am heading in right direction. Thanks

RituRajSingh878 commented 4 years ago

@gramhagen any suggestions on the above-commented code?

Also, I am wondering that do we have score functions for telling different types of scores after fitting train data and prediction of the test data liker2_score, f1_score, confusion matrix and others.

gramhagen commented 4 years ago

@RituRajSingh878 thanks for starting this. I think we should leverage the internal multiclass support that vw offers using learners like oaa. In fact the existing VW class does support multiclass output

>>> from vowpalwabbit.sklearn_vw import VW

>>> vw = VW(oaa=3, probabilities=True, quiet=True)
>>> vw.fit(data)
>>> vw.predict(data[:1,:])
array([[0.28497157, 0.32211345, 0.39291507]])

So I think the key will be just to enforce use of probabilities and one of the multiclass learners (like oaa for default). Then we need to match the multiclass sklearn api, so it would be helpful to find a test case on the sklearn side which we can use to ensure we have the same bevavior.

Does that help?

RituRajSingh878 commented 4 years ago

Then we need to match the multiclass sklearn api, so it would be helpful to find a test case on the sklearn side which we can use to ensure we have the same bevavior.

I don't understand this part mainly "matching with multiclass sklearn api", so can you explain it in more detail? And What about score functions? Do we have any?

gramhagen commented 4 years ago

the goal of the sklearn_vw module is to provide an API that matches what sklearn has, so you can use the same syntax and integrate this with other tools in the sklearn eco-system. The object inheritance you specified should provide the method support needed. I think we may just need to override the decision_function() method?

One way to test this out would be use an sklearn test case and replace it with the VWMultiClassifier, e.g.: https://github.com/scikit-learn/scikit-learn/blob/b194674c42d54b26137a456c510c5fdba1ba23e0/sklearn/tests/test_multiclass.py#L146

Although, this example is handling sparse data, so we will probably need additional work to support that?

Another way is to implement VWMultiClassifier then instantiate the vw model as well as the sklearn OneVsRestClassifer and then step through side by side training on the same data and looking at the results of each function to make sure they align (not necessarily get the same numeric results, but the output type matches).

Methods are here: https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html

I don't think we should have to implement our own score functions as those should get inherited correctly.

RituRajSingh878 commented 4 years ago

@gramhagen

Another way is to implement VWMultiClassifier then instantiate the vw model as well as the sklearn OneVsRestClassifer and then step through side by side training on the same data and looking at the results of each function to make sure they align (not necessarily get the same numeric results, but the output type matches).

Methods are here: https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html

Now, I understand things clearly. I will implement it according to this plan.

I don't think we should have to implement our own score functions as those should get inherited correctly.

I was talking about supporting of scores in core VW like pr #1001 for the score function. I wan trying to add a loss function with this pr #2313 but I am not able to understand few things so if you can help here.

Thanks

gramhagen commented 4 years ago

Sounds good. For scores like f-score, confusion matrix, etc. we can use the sklearn metrics library for that. That's the nice part about matching the sklearn api.

As for adding the Huber loss function, I'm probably not the best one to help out there, sorry. I'll comment on that thread though.

RituRajSingh878 commented 4 years ago

@gramhagen thanks

Sounds good. For scores like f-score, confusion matrix, etc. we can use the sklearn metrics library for that. That's the nice part about matching the sklearn api.

I will implement it.

gramhagen commented 4 years ago

fyi @RituRajSingh878 here are a couple of nice resources for developing sklearn estimators http://danielhnyk.cz/creating-your-own-estimator-scikit-learn/ https://ploomber.io/posts/sklearn-custom/

RituRajSingh878 commented 4 years ago

fyi @RituRajSingh878 here are a couple of nice resources for developing sklearn estimators http://danielhnyk.cz/creating-your-own-estimator-scikit-learn/ https://ploomber.io/posts/sklearn-custom/

@gramhagen thanks

I have opened a draft pr to start this- #2332