Open RituRajSingh878 opened 4 years ago
@gramhagen are you able to help answer this? Does a logistic regressor make sense given VW is pretty much modeling with a linear function?
I think, then logistic will not make any sense here. But let's assume, I wish to add linear models like Laso
, ridge
,linear regression
then?. I want to extend the VWRegressor and VWClassifier with different sklearn models.
Also, VWClassifier
doesn't support multiclass classification so if I want to extend it then how should I proceed?
@RituRajSingh878 you can change the loss function, regularization and a few other parameters used with the current implementation of VWRegressor. See options here: https://github.com/VowpalWabbit/vowpal_wabbit/blob/e3f4038dbfb890058ff128d69a44ffec7c251236/python/vowpalwabbit/sklearn_vw.py#L186
However, the sklearn implementation is just a wrapper around the pyvw binding, which in turn is just a wrapper around the core VW functionality. So we are limited to what the core library is capable of. To introduce new functionality would require changes similar to what was done in PR #1001
I'm not sure I follow the differences implemented vs what we have now. Do you have an example use for the VWRegressor? Either what you can do with the new code, or what you can't do with the old one (beside changing the underlying algorithm to Lasso, ridge, etc.) that would help me understand.
Lastly, supporting multiclass in the VWClassifier would be great! The base pyvw class does support this, so it shouldn't be too bad to add sklearn api support as well. Take a look at this test case: https://github.com/VowpalWabbit/vowpal_wabbit/blob/e3f4038dbfb890058ff128d69a44ffec7c251236/python/tests/test_pyvw.py#L73 and the sklearn implementation for a multiclass estimator: https://github.com/scikit-learn/scikit-learn/blob/b194674c42d54b26137a456c510c5fdba1ba23e0/sklearn/multiclass.py#L132
I'm thinking it might be easier to add a new VWMultiClassifier class for this? Happy to help you through this is you want to take a stab at something.
@RituRajSingh878 you can change the loss function, regularization and a few other parameters used with the current implementation of VWRegressor. See options here:
However, the sklearn implementation is just a wrapper around the pyvw binding, which in turn is just a wrapper around the core VW functionality. So we are limited to what the core library is capable of. To introduce new functionality would require changes similar to what was done in PR #1001
I will have to look more into this so that I can understand it clearly after that I will try to work around this.
I'm not sure I follow the differences implemented vs what we have now. Do you have an example use for the VWRegressor? Either what you can do with the new code, or what you can't do with the old one (beside changing the underlying algorithm to Lasso, ridge, etc.) that would help me understand.
Initially, I misunderstood the working of VWRegressor but I get it now.
I'm thinking it might be easier to add a new VWMultiClassifier class for this? Happy to help you through this is you want to take a stab at something.
I will try to add VWMulticlassiifer
and will open a pr for the same.
I am trying to implement like this-
+def _fit_binary(estimator, X, y, classes=None):
+ """Fit a single binary estimator."""
+ unique_y = np.unique(y)
+ if len(unique_y) == 1:
+ if classes is not None:
+ if y[0] == -1:
+ c = 0
+ else:
+ c = y[0]
+ warnings.warn("Label %s is present in all training examples." %
+ str(classes[c]))
+ estimator = _ConstantPredictor().fit(X, unique_y)
+ else:
+ estimator = clone(estimator)
+ estimator.fit(X, y)
+ return estimator
+
+
+class VWMultiClassifier(MultiOutputMixin, ClassifierMixin, MetaEstimatorMixin, VW):
+ """Vowpal Wabbit MultiClassifier model """
+
+ def __init__(self, estimator, n_jobs):
+ self.estimator = estimator
+ self.n_jobs = n_jobs
+
+ def fit(self, X, y):
+ """Fit underlying estimators.
+ Parameters
+ ----------
+ X : (sparse) array-like of shape (n_samples, n_features)
+ Data.
+ y : (sparse) array-like of shape (n_samples,) or (n_samples, n_classes)
+ Multi-class targets. An indicator matrix turns on multilabel
+ classification.
+ Returns
+ -------
+ self
+ """
+ self.label_binarizer_ = LabelBinarizer(sparse_output=True)
+ Y = self.label_binarizer_.fit_transform(y)
+ Y = Y.tocsc()
+ self.classes_ = self.label_binarizer_.classes_
+ columns = (col.toarray().ravel() for col in Y.T)
+ # In cases where individual estimators are very fast to train setting
+ # n_jobs > 1 in can results in slower performance due to the overhead
+ # of spawning threads. See joblib issue #112.
+ self.estimators_ = Parallel(n_jobs=self.n_jobs)(delayed(_fit_binary)(
+ self.estimator, X, column, classes=[
+ "not %s" % self.label_binarizer_.classes_[i],
+ self.label_binarizer_.classes_[i]])
+ for i, column in enumerate(columns))
+
+ return self
+
But before opening a pr, I want to check it with you that I am heading in right direction. Thanks
@gramhagen any suggestions on the above-commented code?
Also, I am wondering that do we have score functions for telling different types of scores after fitting train data and prediction of the test data liker2_score
, f1_score
, confusion matrix
and others.
@RituRajSingh878 thanks for starting this. I think we should leverage the internal multiclass support that vw offers using learners like oaa. In fact the existing VW class does support multiclass output
>>> from vowpalwabbit.sklearn_vw import VW
>>> vw = VW(oaa=3, probabilities=True, quiet=True)
>>> vw.fit(data)
>>> vw.predict(data[:1,:])
array([[0.28497157, 0.32211345, 0.39291507]])
So I think the key will be just to enforce use of probabilities and one of the multiclass learners (like oaa for default). Then we need to match the multiclass sklearn api, so it would be helpful to find a test case on the sklearn side which we can use to ensure we have the same bevavior.
Does that help?
Then we need to match the multiclass sklearn api, so it would be helpful to find a test case on the sklearn side which we can use to ensure we have the same bevavior.
I don't understand this part mainly "matching with multiclass sklearn api", so can you explain it in more detail? And What about score functions? Do we have any?
the goal of the sklearn_vw module is to provide an API that matches what sklearn has, so you can use the same syntax and integrate this with other tools in the sklearn eco-system. The object inheritance you specified should provide the method support needed. I think we may just need to override the decision_function() method?
One way to test this out would be use an sklearn test case and replace it with the VWMultiClassifier, e.g.: https://github.com/scikit-learn/scikit-learn/blob/b194674c42d54b26137a456c510c5fdba1ba23e0/sklearn/tests/test_multiclass.py#L146
Although, this example is handling sparse data, so we will probably need additional work to support that?
Another way is to implement VWMultiClassifier then instantiate the vw model as well as the sklearn OneVsRestClassifer and then step through side by side training on the same data and looking at the results of each function to make sure they align (not necessarily get the same numeric results, but the output type matches).
Methods are here: https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html
I don't think we should have to implement our own score functions as those should get inherited correctly.
@gramhagen
Another way is to implement VWMultiClassifier then instantiate the vw model as well as the sklearn OneVsRestClassifer and then step through side by side training on the same data and looking at the results of each function to make sure they align (not necessarily get the same numeric results, but the output type matches).
Methods are here: https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html
Now, I understand things clearly. I will implement it according to this plan.
I don't think we should have to implement our own score functions as those should get inherited correctly.
I was talking about supporting of scores in core VW like pr #1001 for the score function. I wan trying to add a loss function with this pr #2313 but I am not able to understand few things so if you can help here.
Thanks
Sounds good. For scores like f-score, confusion matrix, etc. we can use the sklearn metrics library for that. That's the nice part about matching the sklearn api.
As for adding the Huber loss function, I'm probably not the best one to help out there, sorry. I'll comment on that thread though.
@gramhagen thanks
Sounds good. For scores like f-score, confusion matrix, etc. we can use the sklearn metrics library for that. That's the nice part about matching the sklearn api.
I will implement it.
fyi @RituRajSingh878 here are a couple of nice resources for developing sklearn estimators http://danielhnyk.cz/creating-your-own-estimator-scikit-learn/ https://ploomber.io/posts/sklearn-custom/
fyi @RituRajSingh878 here are a couple of nice resources for developing sklearn estimators http://danielhnyk.cz/creating-your-own-estimator-scikit-learn/ https://ploomber.io/posts/sklearn-custom/
@gramhagen thanks
I have opened a draft pr to start this- #2332
I am trying to integrate different models of
sklearn
withVWRegerssor
andVWClassifier
. But I am a little bit confused with VW and How it works.Let's assume I am trying to implement/integrate multiclass classifier in VWClassifier then How should I proceed with VW- https://github.com/VowpalWabbit/vowpal_wabbit/blob/3c0f0ffb84c47701521abba6450a8cf7b39fba83/python/vowpalwabbit/sklearn_vw.py#L505
In sklearn, they have made different classes for different types of models so should I proceed like them or I have to add every classifier model to
VWClassifier
and every regressor model toVWRegressor
(If yes, then, I don't have any idea how should I proceed).sklearn models-
https://github.com/scikit-learn/scikit-learn/blob/b194674c42d54b26137a456c510c5fdba1ba23e0/sklearn/linear_model/_logistic.py#L1191I have implemented a Regressor(similear to Classifier model) model in VMRegressor-
I will open a pr if it looks good to you.
Another question is, I wish to implement the
LogisticRegressor
model, then How should I proceed in VW(a general idea or formate that VW follows)?I am trying to understanding VW.