ageron / handson-ml

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Apache License 2.0
25.2k stars 12.91k forks source link

Chapter 5: centering the training data for LinearSVC in a sparse matrix #481

Open genemishchenko opened 5 years ago

genemishchenko commented 5 years ago

Hi Aurelien.

In Chapter 5 on SVMs you write:

The LinearSVC class regularizes the bias term, so you should center the training set first by subtracting its mean. This is automatic if you scale the data using the StandardScaler.

The StandardScaler, however, does NOT work on sparse matrices, which are very common in NLP applications (CountVectorizer's and TfidfTransformer's output type is csr_matrix, for instance). This may be worth noting in the book.

That was the comment part... I also have a question based on it:

I found this great sklearn documentation page describing in detail what all the scalers do: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html (it focuses on managing outliers, but it's a great general visual overview)

It does not look like there is any other scaler that can do the same thing as StandardScaler does with dense arrays, but also works on sparse matrices, so I have implemented a function that centers each column/feature:

import pandas as pd

def center_csr_maxtrix(X):
    if not type(X).__name__ == 'csr_matrix':
        # could use "isinstance()", but have to import the class for that
        raise TypeError('The argument is not a sparse CSR matrix')
    # make a copy so we don't modify the original matrix
    X_copy = X.copy()
    # get the row and column indices of all the non-null values (including explicit zeroes)
    # as separate flat arrays with position-wise logical pairing, then convert to Pandas DF
    nn_row_idx, nn_col_idx = X_copy.nonzero()
    nn_value_indices = pd.DataFrame({"row": nn_row_idx, "col": nn_col_idx})
    # get a unique set of columns with non-null values
    nn_cols_set = set(nn_col_idx)
    # iterate through each column(dimension) with non-null values
    for c in nn_cols_set:
        # compute the mean for the column (can't use the mean() method, as it considers zeroes)
        c_mean = X_copy[:, c].sum() / X_copy[:, c].nnz
        # filter the index DF to get the indices of the values in the relevant column only
        # and iterate over all such index records
        for i, nn_val_idx in nn_value_indices[nn_value_indices['col'] == c].iterrows():
            # subtract the column mean from the value in place
            X_copy[nn_val_idx['row'], c] = X_copy[nn_val_idx['row'], c] - c_mean
    return X_copy

Can you recommend a better method? This takes a long time... and I suspect that the inner loop with value-by-value assignment takes the longest, but incrementing a slice of a sparse matrix by a scalar is not implemented (I tried). Any help will be appreciated.

Thank you. Gene.

genemishchenko commented 5 years ago

Actually, the biggest issue that I have is that centering the data absolutely kills the accuracy of LinearSVC. After applying the self-made centering method above on the training sparse matrix of 130K features and 12K instances with 1.4M total data points LinearSVC gives me only 90% accuracy on the full training set and the measly 50% on the test set. Without centering or scaling of any kind I get 99.5% and 85% correspondingly (I tried both SGDClassifier and LinearSVC, resulting in the same maximum accuracy for the data at hand). Am I doing something wrong or does centering the data actually hurt the LinearSVC performance?