explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.84k stars 4.38k forks source link

Text Categorizer (accuracy issues) #2746

Closed mattkallo closed 6 years ago

mattkallo commented 6 years ago

This is most probably an issue with my training process/data. However I am not able to sort it out after spending few days on it. Need your help/inputs

I am trying to train a text categorizer to identify stock market related news titles and facing some issues with prediction of unseen data. Its a binary classifier (2 classes- stock market related or not related). My training set is roughly 400 stock market news titles and 600+ non-stock market related titles.

Problems I have noticed -

Questions -

Thanks for any feedback/input on this.

Your Environment

Info about spaCy

honnibal commented 6 years ago

Debugging specific datasets is a bit outside of the scope of what we can help with here. I'd recommend comparing against a normal bag-of-words classifier, e.g. from scikit-learn, to check how spaCy's classifier is comparing to others. More training data might be helpful too.

If you haven't tried Prodigy yet, it has a utility called textcat.train-curve which checks the accuracy on 80%, 50%, 25% etc of the training data. This helps you project how your accuracy might look at 120%, 150% etc of your current dataset, so you can guess how much data to collect.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.