Xtra-Computing / thundersvm

ThunderSVM: A Fast SVM Library on GPUs and CPUs
Apache License 2.0
1.57k stars 218 forks source link

Predicted class only has one label #236

Open Salehoof opened 3 years ago

Salehoof commented 3 years ago

Hello there, I am trying to use thundersvm on a text classification problem. I can run the test and get the 0.98 accuracy, so it seems that the library is working for test data. The problem is that when I want to use this on a text classification problem(e.g. 20 newsgroups dataset), I got very stange predictionand therefore low accuracy in comparison to sklearn SVC class. (In fact, y_pred is all "0" !). To demonstrate the problem, I have made a simple function to test it:

import time
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer, TfidfTransformer
import numpy as np
from sklearn import svm
import thundersvm
from sklearn.datasets import fetch_20newsgroups

def compare_sklearn_thunder(clf):
    categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
    twenty_train = fetch_20newsgroups(data_home='.', subset='train', categories=categories, shuffle=True, random_state=42)
    twenty_test = fetch_20newsgroups(data_home='.', subset='test', categories=categories, shuffle=True, random_state=42)

    count_vect = CountVectorizer()
    tfidf_transformer = TfidfTransformer()

    X_counts = count_vect.fit_transform(twenty_train.data + twenty_test.data)
    X_tfidf = tfidf_transformer.fit(X_counts)
    X_train = X_tfidf.transform(X_counts[:len(twenty_train.data)])
    X_test = X_tfidf.transform(X_counts[len(twenty_train.data):])

    s_time = time.time()
    print('X_train.shape = {}'.format(X_train.shape))
    print('X_test.shape = {}'.format(X_test.shape))
    clf.fit(X_train, twenty_train.target)
    train_time = time.time() - s_time
    print('Training Time = {:.4} seconds'.format(train_time))
    y_pred = clf.predict(X_test).astype(int)
    print(y_pred[:10])
    print('Accuracy = {:.2}'.format(np.mean(y_pred == twenty_test.target)))

Now if I test the thundersvm.SVC(): compare_sklearn_thunder(thundersvm.SVC()) I got:

X_train.shape = (2257, 47319)
X_test.shape = (1502, 47319)
Training Time = 1.954 seconds
sample y_pred : [0 0 0 0 0 0 0 0 0 0]
sample y_true : [2 2 2 0 3 0 1 3 2 2]
Accuracy = 0.21

but when I test sklearn SVC: compare_sklearn_thunder(svm.SVC())

It works fine:

X_train.shape = (2257, 47319)
X_test.shape = (1502, 47319)
Training Time = 8.247 seconds
sample y_pred : [2 2 2 0 3 0 1 3 1 1]
sample y_true : [2 2 2 0 3 0 1 3 2 2]
Accuracy = 0.88

How Can I solve this problem? Thanks in advance.

QinbinLi commented 3 years ago

Hi @Salehoof ,

You can tune the parameters of SVC to get a good model. For example, if you trysvm.SVC(gamma=0.5, C=100), the accuracy of ThunderSVM is 0.9.