Text Categorizer (accuracy issues)

mattkallo commented 6 years ago

This is most probably an issue with my training process/data. However I am not able to sort it out after spending few days on it. Need your help/inputs

I am trying to train a text categorizer to identify stock market related news titles and facing some issues with prediction of unseen data. Its a binary classifier (2 classes- stock market related or not related). My training set is roughly 400 stock market news titles and 600+ non-stock market related titles.

Problems I have noticed -

Its picking up all news with any number/currency symbol in it as stock market related. Though the training set has many other words like investment, market etc. Its still picking up news/article titles like sales/deals (eg: with text like - Walmart 16GB RAMM $20.00)
News with no currency symbol but numbers eg: Changes in year 2018.

Questions -

Is this because of the short length of the "title" ? Most of the positive training data has words very specific to stock market. But its still picking up totally unrelated news titles.
Should I include more negative test data? Will that make it better? (400 positive cases and 2000+ negative cases - will this create any bias/imbalance?
Will removing numbers and currency symbols from training data help?

Thanks for any feedback/input on this.

Your Environment

Info about spaCy

spaCy version: 2.0.11
Platform: Darwin-17.7.0-x86_64-i386-64bit
Python version: 3.6.4
Models: en_core_web_lg, en

honnibal commented 6 years ago

Debugging specific datasets is a bit outside of the scope of what we can help with here. I'd recommend comparing against a normal bag-of-words classifier, e.g. from scikit-learn, to check how spaCy's classifier is comparing to others. More training data might be helpful too.

If you haven't tried Prodigy yet, it has a utility called textcat.train-curve which checks the accuracy on 80%, 50%, 25% etc of the training data. This helps you project how your accuracy might look at 120%, 150% etc of your current dataset, so you can guess how much data to collect.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

Text Categorizer (accuracy issues) #2746

Your Environment

Info about spaCy