Chapter 5: centering the training data for LinearSVC in a sparse matrix

Hi Aurelien.

In Chapter 5 on SVMs you write:

The LinearSVC class regularizes the bias term, so you should center the training set first by subtracting its mean. This is automatic if you scale the data using the StandardScaler.

The StandardScaler, however, does NOT work on sparse matrices, which are very common in NLP applications (CountVectorizer's and TfidfTransformer's output type is csr_matrix, for instance). This may be worth noting in the book.

That was the comment part... I also have a question based on it:

I found this great sklearn documentation page describing in detail what all the scalers do: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html (it focuses on managing outliers, but it's a great general visual overview)

It does not look like there is any other scaler that can do the same thing as StandardScaler does with dense arrays, but also works on sparse matrices, so I have implemented a function that centers each column/feature:

import pandas as pd

def center_csr_maxtrix(X):
    if not type(X).__name__ == 'csr_matrix':
        # could use "isinstance()", but have to import the class for that
        raise TypeError('The argument is not a sparse CSR matrix')
    # make a copy so we don't modify the original matrix
    X_copy = X.copy()
    # get the row and column indices of all the non-null values (including explicit zeroes)
    # as separate flat arrays with position-wise logical pairing, then convert to Pandas DF
    nn_row_idx, nn_col_idx = X_copy.nonzero()
    nn_value_indices = pd.DataFrame({"row": nn_row_idx, "col": nn_col_idx})
    # get a unique set of columns with non-null values
    nn_cols_set = set(nn_col_idx)
    # iterate through each column(dimension) with non-null values
    for c in nn_cols_set:
        # compute the mean for the column (can't use the mean() method, as it considers zeroes)
        c_mean = X_copy[:, c].sum() / X_copy[:, c].nnz
        # filter the index DF to get the indices of the values in the relevant column only
        # and iterate over all such index records
        for i, nn_val_idx in nn_value_indices[nn_value_indices['col'] == c].iterrows():
            # subtract the column mean from the value in place
            X_copy[nn_val_idx['row'], c] = X_copy[nn_val_idx['row'], c] - c_mean
    return X_copy

Can you recommend a better method? This takes a long time... and I suspect that the inner loop with value-by-value assignment takes the longest, but incrementing a slice of a sparse matrix by a scalar is not implemented (I tried). Any help will be appreciated.

Thank you. Gene.

ageron / handson-ml

Chapter 5: centering the training data for LinearSVC in a sparse matrix #481