VowpalWabbit / vowpal_wabbit

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
https://vowpalwabbit.org
Other
8.49k stars 1.93k forks source link

Sklearn adapter function `tovw` does not support unsigned integers in features #4609

Open jackgerrits opened 1 year ago

jackgerrits commented 1 year ago

Mitigation

A user should use signed integer types and not unsigned integer types when passing to the sklearn adapter functions.

Details

The tovw function uses dump_svmlight_file to convert to a format that can easily construct VW text examples.

This function does not support input of unsigned integers, it requires signed due to the pyx code internally in sklearn.

Fails:

from vowpalwabbit.sklearn import VWRegressor
import numpy as np
import pandas as pd

X = pd.DataFrame({'a': [1]}, dtype='uint32')
y = pd.Series(np.zeros(1))

VWRegressor().fit(X, y)

Succeeds:

from vowpalwabbit.sklearn import VWRegressor
import numpy as np
import pandas as pd

X = pd.DataFrame({'a': [1]}, dtype='int32') # <-----
y = pd.Series(np.zeros(1))

VWRegressor().fit(X, y)

The same input works when passed to SKLearn itself:

from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd

X = pd.DataFrame({'a': [1]}, dtype='uint32')
y = pd.Series(np.zeros(1))

LinearRegression().fit(X, y)

To fix this one way is to avoid using the dump_svmlight_file function. It is used currently as a way to easily convert the dataframe to vw text format.

mahimairaja commented 1 year ago

Is this issue still open?

jackgerrits commented 1 year ago

Yep! Feel free to tackle it if you'd like

manthanindane commented 1 year ago

Is this issue open? Can I work on it?