barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
696 stars 101 forks source link

Extreme memory usage for words near longest word length in the dictionary #61

Closed blayzen-w closed 4 years ago

blayzen-w commented 4 years ago

It looks like when you use a string that will never match a word but sill fits under the the maximum word length can cause extreme memory usage.

Take the example "57ef934a-dbb0-4978-8626d41c819274". The first set of candidates for distance 1 generates an array of 3,214 words. Sense none of these will match it will then try to generate the list of distance 2 candidates based on each distance 1 candidate which will expand the array out to 10,380,316 words which eats up around 30%+ of our 4gb server. After around 4 or 5 requests the server maxes out memory and freezes. If this package is used to spell correct a user input field, a bot could easily overload the server.

It looks like it should be possible to limit the words generated when iterating over the distance 1 candidates and generating the distance 2 sets. Basically only concat the distance 2 words match the known set and ignore the ones that don't. This would limit the maximum available words to around the total distance * number of words for distance 1. So in this case around 6,428 instead of 10,380,316 words in memory at the same time. This should be possible by modifying the functionality of the __edit_distance_alt and possibly edit_distance_2.

Example: return [e2 for e1 in tmp for e2 in self.known(self.edit_distance_1(e1))]

I'll make a PR in a bit for this

barrust commented 4 years ago

Sounds great! Look forward to seeing the PR.

barrust commented 4 years ago

Thank you for the PR!