Closed thoppe closed 7 years ago
Profiling suggests this takes up a significant fraction of time and could probably be refactored:
# Identify which phrases were used and possible replacements
R = collections.defaultdict(list)
for key, val in self.rdict.iteritems():
if key in ldoc:
R[val].append(key)
By matching to a unique subset of the words first (with proper breaks for punctuation), we can dramatically cut down on the time needed for this module by about 50%!
time frac
function
token_replacement 0.000007 0.000035
unidecoder 0.000009 0.000043
dedash 0.000346 0.001635
titlecaps 0.001804 0.008525
decaps_text 0.002472 0.011680
identify_parenthetical_phrases 0.005658 0.026733
replace_acronyms 0.006414 0.030304
separated_parenthesis 0.006859 0.032405
pos_tokenizer 0.060114 0.284009
replace_from_dictionary 0.127977 0.604630
Even though speed isn't our top concern,
replace_from_dictionary
is orders of magnitude slower than most functions.