VarIr commented 5 years ago

Under certain circumstances ThunderSVM is surprisingly slow when invoked via the Python bindings. In my current use case, I have replicated nested cross-validation procedures with relatively small data sets (that is, many calls to fit and predict). There, currently significant time seems to be spent in list comprehensions that create numpy arrays from ctypes.

A quick-fix seemed to be exchanging the list comprehensions with calls to np.frombuffer, which improved speed quite a bit.

I created this PR in case you're interested to include these changes into upstream.

zeyiwen commented 5 years ago

Thanks. We'll run some tests, and merge the PR. Our team is currently working on paper deadlines, so it may take some time.

VarIr commented 5 years ago

I'm not very familiar with C/C++ memory management, so let's discuss this.

My current understanding is this: After fitting, the support vectors live somewhere in memory not handled by Python initially. In Python, first ctypes arrays are allocated, e.g. csr_row = (c_int * arr_length)(). Then, thundersvm.get_sv(csr_row, ...) fills the arrays. I assume, in this step the data is copied from the original location to the new one. (Is this correct? I have not checked, what the function actually does). np.frombuffer(csr_row, ...) then creates a numpy array referencing the new location. This should increment the ctypes array's refcount, which should, thus, not be garbage collected. In this case, we should be fine.

If there is no copy made in get_sv, I don't know, but I would assume that self.row could get corrupted, if the underlying data is modified. This could be easily avoided, if we do np.frombuffer().copy(), however. There would be no gain compared to the current implementation in terms of memory, but it will still be faster (avoiding all the type checks within the list comprehension, etc.)

QinbinLi commented 5 years ago

Hi, @VarIr

You are right. There is copy inside thundersvm.get_sv(csr_row, ...). As long as csr_row will not be released by python garbage collection, it should be fine to use frombuffer. I'll merge the request. Thanks.

beevabeeva commented 4 years ago

EDIT: Please see issue #172

I am having a similar issue. I am using the latest master branch of thundersvm. Compared to serial sklearn (LibSVM), thundersvm is orders of magnitude slower. I am probably doing something wrong though. I have tested this on a GTX 750 Ti and 1060 Ti with the same results. I have to stop thundersvm because it just seems like it will never end, while the serial sklearn takes 0.5 seconds on 50 000 instances of data (5 features).

Here is my test code if you would like to try replicate this (dataset is BNG_COMET: https://www.openml.org/d/5648):

ThunderSVM test:

from thundersvm import SVC
import numpy as np
import time

data = np.loadtxt(open("atm/demos/BNG_COMET.csv", "rb"), delimiter=",", skiprows=1)
# data = np.loadtxt(open("atm/demos/pollution.csv", "rb"), delimiter=",", skiprows=1)

# print(data, data.shape)

X= data[:5000,:-1]

y = data[:5000,-1]

xp_lots_of_test_samples = data[5100:5103,:-1]

print("X",X, X.shape)

print(y)

start=time.time()

clf =  SVC(C=176.6677880062673, cache_size=150, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma=12005.61948153516, gpu_id=1,
    kernel='linear', max_iter=5, max_mem_size=-1, n_jobs=-1, probability=True,
    random_state=None, shrinking=True, tol=0.001, verbose=True)

clf.fit(X,y)

end_time =time.time()

totaltime = end_time-start

print('time: ',totaltime)

print("predictions:")
print(clf.predict(xp_lots_of_test_samples))
print("true labels:")

print(data[5100:5103,-1])

Sklearn test:

from sklearn import svm
import numpy as np
import time

# data = np.loadtxt(open("atm/demos/pollution.csv", "rb"), delimiter=",", skiprows=1)
data = np.loadtxt(open("atm/demos/BNG_COMET.csv", "rb"), delimiter=",", skiprows=1)

X= data[:50000,:-1]

y = data[:50000,-1]

xp_lots_of_test_samples = data[50100:50103,:-1]

# clf = svm.SVC(kernel='rbf',
#          verbose=True,
#          gamma=0.5, 
#          C=120.51564536384429, 
#          max_iter = 50000,
#          class_weight = 'balanced'
#                 )

start =time.time()
clf = svm.SVC(C=176.6677880062673, cache_size=150, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma=12005.61948153516,
    kernel='linear', max_iter=5, probability=True,
    random_state=None, shrinking=True, tol=0.001, verbose=True)

clf.fit(X,y)

end_time =time.time()

totaltime = end_time-start

print('time: ',totaltime)
print("predictions:")
print(clf.predict(xp_lots_of_test_samples))
print("true lables:")
# print(data[137:150,-1])
print(data[50100:50103,-1])

zeyiwen commented 4 years ago

Is the problem still there? If so, please open a new issue, such that we can find a time to fix it.

beevabeeva commented 4 years ago

@zeyiwen thanks for your help. I will open a new one now.

Xtra-Computing / thundersvm

WIP Python performance #162

EDIT: Please see issue #172