explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.69k stars 4.36k forks source link

Adding many special cases to Tokenizer greatly degrades startup performance #12534

Closed adrianeboyd closed 1 year ago

adrianeboyd commented 1 year ago

Discussed in https://github.com/explosion/spaCy/discussions/12523

Originally posted by **Nickersoft** April 12, 2023 Hey folks, I wasn't sure whether to flag this as a bug or a question, so to play it safe I opened it as a discussion. Recently, after running some benchmarks, I've noticed that adding several special cases to the spaCy tokenizer (in this case over 200k) severely impacts the time it takes to load the pipeline. For clarity, I'm attempting to add several compound English phrases to the tokenizer (like "lay out" or "garbage man") so they are preserved when processing text. - Without any special cases added, loading my pipeline takes about **3s** on average. - Loading in my special cases at runtime results in a latency of about **20s**. - If I save my own pipeline by loading the special cases beforehand, then serializing it to a directory and loading my pipeline from the path, it takes upwards of **40s-130s**. I would have thought that the last case would have been the _most_ performant, seeing it's writing the tokenizer to disk with all of the special cases contained in it, so I was surprised to see it perform so poorly. I would think for the second use case this latency would make sense, as it would need to iterate over 200k words individually and add each to the tokenizer via `add_special_case`. The reason I am filing this as a discussion and not a bug is I'm not sure if this is the best way to achieve what I'm hoping to, or if there is something I can do on my end to improve performance. I can provide code snippets as needed, though right now it's all pretty straightforward (loading a pipeline via `load()`, looping through my words and adding each via `add_special_case`, then writing it to disk via `nlp.to_disk()`).
adrianeboyd commented 1 year ago

Opened as a separate issue to track the speed regression for Tokenizer.from_disk.

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.