cfinke / Typo.js

A client-side JavaScript spellchecker that uses Hunspell-style dictionaries.
Other
501 stars 110 forks source link

50% speed increase for suggest call #41

Closed Thomas101 closed 7 years ago

Thomas101 commented 8 years ago

I've done some performance tweaking for the suggest call. No changes to api, no changes to results. Algorithm remains the same. This removes some redundancy in processing.

Here's some stats from some test words...

takr es6: 52ms polyfil: 54ms original: 115ms 55% faster

hoouse es6: 134ms polyfil: 133ms original: 200ms 33% faster

suggestionsr es6: 590ms polyfil: 598ms original: 1201ms 51% faster

dsdsfdsfdsfdfsdfsdfsdf es6: 3105ms polyfil: 3116ms original: 5676ms 46% faster

kofifus commented 8 years ago

good work :)

one fix you can do is that in https://github.com/Thomas101/Typo.js/blob/bb9862bd6d0c81fa0ebdc848bdd0ebb7b222f644/typo/typo.js#L932

it's enought to do index[v]=true; which will save some memory.

But in general the problem with your solution is memory usage. doing ed1.forEach(function(word) { ed2.add(word) }); creates a huge array (easily over 100,000 words), your WordList then adds another array (index) of the same size and another hash (words) of the same size, and Set probably does the same internally. All this on top of the dictionaries that typo loads to memory.

In https://github.com/cfinke/Typo.js/pull/45 I optimized the memory use of suggest to ~ 2* edit1 that is ~1000 words. This can be further optimized to 1 * edit1 - when you think about it we can run 'known' inside the iteration that builds ed2 without storing the ed2 values as we never need them again. I found I got huge speed gains as well - I think not holding these huge structures results in less paging and everything going much faster.

But that kind of solution and async implementation still leaves the issue of duplicate entries. I wonder if a more clever edit1 algorithm can not produce duplicate in the first place. Or something like https://github.com/itslenny/SymSpell.js/blob/master/dist/SymSpell.js

Cheers

Thomas101 commented 8 years ago

I agree on the fix, I don't know what came over me there for adding the value in twice! This patch was the lowest hanging fruit I could find with a pretty big bang for buck in terms of speed improvements. There's definately further optimisation that can be done on the suggestion algorithm as a whole. Both in terms of speed and memory.

cfinke commented 7 years ago

Thanks for the contribution @Thomas101 and @kofifus, and sorry for the delay in merging.

cfinke commented 7 years ago

I had to revert this -- the quality of the suggest() results dropped noticeably. @Thomas101 Was there an intended change to functionality, or was it supposed to be performance changes only?

Thomas101 commented 7 years ago

The results should have been the same. Do you have any sample suggestion sets? Were you using node or the browser?

cfinke commented 7 years ago

I was running the tests in the browser via the Chrome extension.

An example: suggest( 'speling' ) returned ["seeking", "spieling", "pelting"], which does not seem as helpful as the old [ "spelling", "spieling", "spewing" ]. All of the old suggestions were one modification away from the argument to suggest(), but two of the new suggestions are two modifications away.

cfinke commented 7 years ago

I realized today why the suggestions changed -- Typo was using the fact that a word added to the edits1() return value multiple times in order to weight that word as a more useful spelling suggestion. The WordList class can't take advantage of this because it uses a hash to store those values.