NIHOPA / NLPre

Python library for Natural Language Preprocessing (NLPre)
190 stars 34 forks source link

Look into FlashText #100

Closed thoppe closed 6 years ago

thoppe commented 7 years ago

A large part of the text processing is still spent replacing keywords, examine the use of FlashText

In this paper we introduce, the FlashText algorithm for replacing keywords or finding keywords in a given text. FlashText can search or replace keywords in one pass over a document. The time complexity of this algorithm is not dependent on the number of terms being searched or replaced. For a document of size N (characters) and a dictionary of M keywords, the time complexity will be O(N). This algorithm is much faster than Regex, because regex time complexity is O(MxN). It is also different from Aho Corasick Algorithm, as it doesn't match substrings. FlashText is designed to only match complete words (words with boundary characters on both sides). For an input dictionary of {Apple}, this algorithm won't match it to 'I like Pineapple'. This algorithm is also designed to go for the longest match first. For an input dictionary {Machine, Learning, Machine learning} on a string 'I like Machine learning', it will only consider the longest match, which is Machine Learning. We have made python implementation of this algorithm available as open-source on GitHub, released under the permissive MIT License.

https://github.com/vi3k6i5/flashtext

https://arxiv.org/abs/1711.00046

thoppe commented 6 years ago

Switching to flashtext for replace_from_dict makes this portion of the code 60 times faster.

thoppe commented 6 years ago
function                        time      frac
unidecoder                      0.000008  0.000122
token_replacement               0.000008  0.000125
dedash                          0.000369  0.005811
replace_from_dictionary         0.000442  0.006967
titlecaps                       0.001944  0.030632
decaps_text                     0.002502  0.039425
identify_parenthetical_phrases  0.005824  0.091770
replace_acronyms                0.006972  0.109849
separated_parenthesis           0.007339  0.115635
pos_tokenizer                   0.038059  0.599663

Previously it was at 0.025758

thoppe commented 6 years ago

Implemented and pushing out new version.