compenguy / ngrammatic

A rust crate providing fuzzy search/string matching using N-grams
MIT License
25 stars 7 forks source link

Improve performance by limiting comparison to only relevant ngrams #6

Closed Hugo-C closed 2 years ago

Hugo-C commented 2 years ago

Hey, I have been trying to improve the search performance by not testing the whole corpus, but only testing the similarity against texts in the corpus which have at least a common ngram with the input text. A downside is that it requires to compute and store an extra hashmap.

I have put a benchmark file in a separated branch so as to not "pollute" this PR: https://github.com/Hugo-C/ngrammatic/tree/performance_improvement_benchmark For a single search in a corpus of 7 words, the improvement is not outstanding at ~18%: image

For 20 searches in a corpus of 100 randoms words, the improvement is more significant at ~94%: image The full benchmark report: criterion.zip

Please let me know your opinion

compenguy commented 2 years ago

The performance improvement looks great, but I'm a bit concerned about memory usage, since each full ngram is now being duplicated probably on average ~8 times (assuming an average word size of 6 that would make 7 copies of the ngram in the gram_to_ngram lookup table, plus the canonical entry in the ngram lookup table).

Since there's already a lookup table for word->ngram, could your added table just map gram->word instead of gram->ngram? admittedly it means you have to do two lookups, gram->word and word->ngram but I'd be interested in seeing the benchmark difference there, because it might be negligible, for a probably-meaningful reduction in memory usage.

Hugo-C commented 2 years ago

The memory usage concerns is understandable. Using the gram->word + word->ngram method is indeed negligible on performance: image criterion.zip For the memory I tried Valgrind's Massif. I reused the code in the rng_search benchmark, so a corpus of 100 words of 10 letters. The files produced if you can make sense of them :smile: : massif_print_new.txt massif_print_old.txt

Based only on the graphs:

    MB
1.672^                                                    :    ::             
     |                                               @@#:::::@::::::          
     |                                             @@@@#:::::@::::::@         
     |                                          @:@@@@@#:::::@::::::@:        
     |                                        ::@:@@@@@#:::::@::::::@:        
     |                                      @@: @:@@@@@#:::::@::::::@:        
     |                                    @:@@: @:@@@@@#:::::@::::::@::       
     |                                  @@@:@@: @:@@@@@#:::::@::::::@:::      
     |                               @@@@@@:@@: @:@@@@@#:::::@::::::@:::      
     |                              @@@@@@@:@@: @:@@@@@#:::::@::::::@::::     
     |                           @::@@@@@@@:@@: @:@@@@@#:::::@::::::@::::     
     |                         :@@: @@@@@@@:@@: @:@@@@@#:::::@::::::@:::::    
     |                       @::@@: @@@@@@@:@@: @:@@@@@#:::::@::::::@::::::   
     |                    ::@@::@@: @@@@@@@:@@: @:@@@@@#:::::@::::::@::::::   
     |                  @@: @@::@@: @@@@@@@:@@: @:@@@@@#:::::@::::::@:::::@   
     |                @@@@: @@::@@: @@@@@@@:@@: @:@@@@@#:::::@::::::@:::::@:  
     |              @:@@@@: @@::@@: @@@@@@@:@@: @:@@@@@#:::::@::::::@:::::@:: 
     |            @@@:@@@@: @@::@@: @@@@@@@:@@: @:@@@@@#:::::@::::::@:::::@:: 
     |          :@@@@:@@@@: @@::@@: @@@@@@@:@@: @:@@@@@#:::::@::::::@:::::@:::
     |       ::::@@@@:@@@@: @@::@@: @@@@@@@:@@: @:@@@@@#:::::@::::::@:::::@:::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   6.828

vs

    KB
371.4^                                       :::#                             
     |                                       :  #    :::@::@:::@:@@::::::::   
     |                                       :  #  :::: @: @:: @:@ :: ::::    
     |                                       :  #:::::: @: @:: @:@ :: ::::    
     |                                       :  #: :::: @: @:: @:@ :: ::::    
     |                                       :  #: :::: @: @:: @:@ :: :::: :  
     |                                     :::  #: :::: @: @:: @:@ :: :::: :  
     |                                   @@: :  #: :::: @: @:: @:@ :: :::: :  
     |                                 @:@ : :  #: :::: @: @:: @:@ :: :::: :  
     |                                :@:@ : :  #: :::: @: @:: @:@ :: :::: :  
     |                             ::::@:@ : :  #: :::: @: @:: @:@ :: :::: :: 
     |                       :@@::::: :@:@ : :  #: :::: @: @:: @:@ :: :::: :: 
     |                       :@ : ::: :@:@ : :  #: :::: @: @:: @:@ :: :::: :: 
     |                       :@ : ::: :@:@ : :  #: :::: @: @:: @:@ :: :::: :: 
     |                   @@:::@ : ::: :@:@ : :  #: :::: @: @:: @:@ :: :::: :: 
     |                @  @ : :@ : ::: :@:@ : :  #: :::: @: @:: @:@ :: :::: :: 
     |                @::@ : :@ : ::: :@:@ : :  #: :::: @: @:: @:@ :: :::: :: 
     |              ::@: @ : :@ : ::: :@:@ : :  #: :::: @: @:: @:@ :: :::: :: 
     |            ::::@: @ : :@ : ::: :@:@ : :  #: :::: @: @:: @:@ :: :::: :: 
     |          ::: ::@: @ : :@ : ::: :@:@ : :  #: :::: @: @:: @:@ :: :::: :: 
   0 +----------------------------------------------------------------------->Mi
     0                                                                   4.150

It seems indeed we cut the RAM usage by 80% :)

Hugo-C commented 2 years ago

Compared to before the PR:

    KB
108.5^                                                  :                     
     |                                          @@##:::::::::::               
     |                                        @@@ # :: ::: : :                
     |                                      @@@@@ # :: ::: : : :              
     |                                    @@@@@@@ # :: ::: : : :              
     |                                  @@@ @@@@@ # :: ::: : : :::            
     |                           :     @@@@ @@@@@ # :: ::: : : ::             
     |                           : ::::@@@@ @@@@@ # :: ::: : : :: ::          
     |                           ::: : @@@@ @@@@@ # :: ::: : : :: :           
     |                        @@@::: : @@@@ @@@@@ # :: ::: : : :: : ::        
     |                       @@  ::: : @@@@ @@@@@ # :: ::: : : :: : :         
     |                    @@@@@  ::: : @@@@ @@@@@ # :: ::: : : :: : : :       
     |                  @@@@ @@  ::: : @@@@ @@@@@ # :: ::: : : :: : : :       
     |              :  :@@@@ @@  ::: : @@@@ @@@@@ # :: ::: : : :: : : :::     
     |              ::::@@@@ @@  ::: : @@@@ @@@@@ # :: ::: : : :: : : ::      
     |            @@:: :@@@@ @@  ::: : @@@@ @@@@@ # :: ::: : : :: : : :: :    
     |          ::@ :: :@@@@ @@  ::: : @@@@ @@@@@ # :: ::: : : :: : : :: ::   
     |        ::::@ :: :@@@@ @@  ::: : @@@@ @@@@@ # :: ::: : : :: : : :: : :: 
     |     :::: ::@ :: :@@@@ @@  ::: : @@@@ @@@@@ # :: ::: : : :: : : :: : :  
     |   :::: : ::@ :: :@@@@ @@  ::: : @@@@ @@@@@ # :: ::: : : :: : : :: : :  
   0 +----------------------------------------------------------------------->KB
     0                                                                   448.1

So the PR (for this specific case of 100 randoms words) improve the speed by a factor 10 for a memory cost of 3/4 times

compenguy commented 2 years ago

Published to crates.io as v0.3.5