Extremely Poor Performance

beevabeeva commented 4 years ago

I am using the latest master branch of thundersvm. Compared to serial sklearn (LibSVM), thundersvm is orders of magnitude slower. I am probably doing something wrong though.

I have tested this on a GTX 750 Ti and 1060 Ti with the same results. I have to stop thundersvm because it just seems like it will never end, while the serial sklearn takes 0.5 seconds on 50 000 instances of data (5 features).

Here is my test code if you would like to try replicate this (dataset is BNG_COMET: https://www.openml.org/d/5648):

ThunderSVM test:

from thundersvm import SVC
import numpy as np
import time

data = np.loadtxt(open("atm/demos/BNG_COMET.csv", "rb"), delimiter=",", skiprows=1)
# data = np.loadtxt(open("atm/demos/pollution.csv", "rb"), delimiter=",", skiprows=1)

# print(data, data.shape)

X= data[:5000,:-1]

y = data[:5000,-1]

xp_lots_of_test_samples = data[5100:5103,:-1]

print("X",X, X.shape)

print(y)

start=time.time()

clf =  SVC(C=176.6677880062673, cache_size=150, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma=12005.61948153516, gpu_id=1,
    kernel='linear', max_iter=5, max_mem_size=-1, n_jobs=-1, probability=True,
    random_state=None, shrinking=True, tol=0.001, verbose=True)

clf.fit(X,y)

end_time =time.time()

totaltime = end_time-start

print('time: ',totaltime)

print("predictions:")
print(clf.predict(xp_lots_of_test_samples))
print("true labels:")

print(data[5100:5103,-1])

Sklearn test:

from sklearn import svm
import numpy as np
import time

# data = np.loadtxt(open("atm/demos/pollution.csv", "rb"), delimiter=",", skiprows=1)
data = np.loadtxt(open("atm/demos/BNG_COMET.csv", "rb"), delimiter=",", skiprows=1)

X= data[:50000,:-1]

y = data[:50000,-1]

xp_lots_of_test_samples = data[50100:50103,:-1]

# clf = svm.SVC(kernel='rbf',
#          verbose=True,
#          gamma=0.5, 
#          C=120.51564536384429, 
#          max_iter = 50000,
#          class_weight = 'balanced'
#                 )

start =time.time()
clf = svm.SVC(C=176.6677880062673, cache_size=150, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma=12005.61948153516,
    kernel='linear', max_iter=5, probability=True,
    random_state=None, shrinking=True, tol=0.001, verbose=True)

clf.fit(X,y)

end_time =time.time()

totaltime = end_time-start

print('time: ',totaltime)
print("predictions:")
print(clf.predict(xp_lots_of_test_samples))
print("true lables:")
# print(data[137:150,-1])
print(data[50100:50103,-1])

beevabeeva commented 4 years ago

Update: This might be due to really bad hyperparameters being passed to ThunderSVM from the AutoML framework. A comment in the ATM source code suggests this:

Notes:

- Support vector machines (svm) can take a long time to train. It's not an

error, it's just part of what happens when the method happens to explore

a crappy set of parameters on a powerful algo like this.

Having said that, this might not be the only issue causing the slow computation.

zeyiwen commented 4 years ago

Thanks. We will look into the issue.

Some quick hints: hyper-parameters can affect convergence; data normalization also affects convergence. You may try to help us find out.

emmenlau commented 4 years ago

Any update on this? I would also be curious to learn if there are performance bottlenecks...

zeyiwen commented 4 years ago

ThunderSVM almost always works much better than the existing ones. The known poor performance of ThunderSVM is the convergence issue in some extreme cases (e.g., the values of each dimension vary from 0 to 10,000), and some extreme hyper-parameters can also affect the efficiency of SVMs (not only ThunderSVM).

TZDZ commented 3 years ago

I also have the same problem. Using the tabular playground of kaggle of feb 2021. `data_train = pd.read_csv('train.csv',sep=",").drop(columns=['id']) data_test = pd.read_csv('test.csv',sep=",").drop(columns=['id']) y = data_train['target'] X = data_train.drop(columns=['target'])

cats_name = [c for c in X.columns if 'cat' in c] cont_name = [c for c in X.columns if 'cont' in c] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) `

`column_trans = ColumnTransformer( [('cats',OneHotEncoder(),cats_name), ('conts',StandardScaler(),cont_name)], remainder='drop')

regr = TransformedTargetRegressor(regressor=svm.LinearSVR(epsilon=0.0, tol=0.0001, C=1.0, loss='squared_epsilon_insensitive', fit_intercept=True, intercept_scaling=1.0, dual=True, verbose=0, random_state=None, max_iter=2000), transformer=StandardScaler())

model = make_pipeline(column_trans,regr) model.fit(X_train, y_train) y_pred = model.predict(X_test) print(mean_squared_error(y_test,y_pred))` executed in 52.6s,

`column_trans = ColumnTransformer( [('cats',OneHotEncoder(),cats_name), ('conts',StandardScaler(),cont_name)], remainder='drop')

regr = TransformedTargetRegressor(regressor=SVR(kernel='linear',epsilon=0.0, tol=0.0001, C=1.0, verbose=0, max_iter=2000) ,transformer=StandardScaler())

model = make_pipeline(column_trans,regr) model.fit(X_train, y_train)` executed in 7m 4s

Xtra-Computing / thundersvm

Extremely Poor Performance #172

Notes:

- Support vector machines (svm) can take a long time to train. It's not an

error, it's just part of what happens when the method happens to explore

a crappy set of parameters on a powerful algo like this.