Closed thoppe closed 6 years ago
Switching to flashtext for replace_from_dict
makes this portion of the code 60 times faster.
function time frac
unidecoder 0.000008 0.000122
token_replacement 0.000008 0.000125
dedash 0.000369 0.005811
replace_from_dictionary 0.000442 0.006967
titlecaps 0.001944 0.030632
decaps_text 0.002502 0.039425
identify_parenthetical_phrases 0.005824 0.091770
replace_acronyms 0.006972 0.109849
separated_parenthesis 0.007339 0.115635
pos_tokenizer 0.038059 0.599663
Previously it was at 0.025758
Implemented and pushing out new version.
A large part of the text processing is still spent replacing keywords, examine the use of FlashText
https://github.com/vi3k6i5/flashtext
https://arxiv.org/abs/1711.00046