Open genemishchenko opened 5 years ago
Actually, the biggest issue that I have is that centering the data absolutely kills the accuracy of LinearSVC. After applying the self-made centering method above on the training sparse matrix of 130K features and 12K instances with 1.4M total data points LinearSVC gives me only 90% accuracy on the full training set and the measly 50% on the test set. Without centering or scaling of any kind I get 99.5% and 85% correspondingly (I tried both SGDClassifier and LinearSVC, resulting in the same maximum accuracy for the data at hand). Am I doing something wrong or does centering the data actually hurt the LinearSVC performance?
Hi Aurelien.
In Chapter 5 on SVMs you write:
The StandardScaler, however, does NOT work on sparse matrices, which are very common in NLP applications (CountVectorizer's and TfidfTransformer's output type is csr_matrix, for instance). This may be worth noting in the book.
That was the comment part... I also have a question based on it:
I found this great sklearn documentation page describing in detail what all the scalers do: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html (it focuses on managing outliers, but it's a great general visual overview)
It does not look like there is any other scaler that can do the same thing as StandardScaler does with dense arrays, but also works on sparse matrices, so I have implemented a function that centers each column/feature:
Can you recommend a better method? This takes a long time... and I suspect that the inner loop with value-by-value assignment takes the longest, but incrementing a slice of a sparse matrix by a scalar is not implemented (I tried). Any help will be appreciated.
Thank you. Gene.