explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.95k stars 4.39k forks source link

Training a spacy custom model for text classification is very slow and taking long time to train data set #5934

Closed ashoksinghnmg closed 4 years ago

ashoksinghnmg commented 4 years ago

Hello Team, I am facing issues while training spacy custom model for text classification .It is very slow and taking long time 1000 record takes 10 mins to train . I am using 37 lables

Please help me how can i make traning faster

svlandeg commented 4 years ago

Hi, sorry to hear you're running into speed issues with your use-case.

There's a bit of a trade-off between the different architectures. In general, "bow" is the fastest but least accurate, "ensemble" is the slowest but often most accurate, and "simple_cnn" is somewhere inbetween. I don't know which one you're using now - but perhaps you could try another?

Unfortunately, we're aware that the current textcat default models are not always great on larger, or more sparse, datasets.

One thing you could try is to run your textclassification problem in 2 (or more) stages: e.g. predict a super category first, then for each such category, run a new model on its subcategories. This may positively effect both accuracy and speed of your solution, though it'll require a bit more complexity in terms of the code/pipeline. Ofcourse this is assuming that you can make some sort of hierarchical tree from your 37 labels.

github-actions[bot] commented 4 years ago

This issue has been automatically closed because it was answered and there was no follow-up discussion.

tanmayag78 commented 3 years ago

@svlandeg Is there any example to convert a multilabel classification into binary in spacy. I also think that dividing into multiple stages will have a positive effect on accuracy.

svlandeg commented 3 years ago

I don't think there is a generic solution possible for that. It depends on the business case and the labels that are in your dataset. If you review them, perhaps you can think of some hierarchical scheme that would make sense and would allow your text classification to run in multiple "steps" - from a coarser grained label to multiple classifiers that work within each coarse label to determine the finer-grained labels.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.