Closed Pierlou closed 10 months ago
If _process_text
is worth investigating a bit more, I would make sure the multiple replace
are taking most of the time (might depend on string length)?
What is taking the most time within this function is camel_case_split
, which splits strings that are written in camel case (hautsDeSeine
=> hauts de seine
). This is twice as long as the replacing part, but I'm not sure if/how we can optimize it (seems quite concise to me as it is).
I'll push the hybrid version (using translate
for single-character replacements and replace
for longer ones), this seems to be the best trade-off.
Update: after additionnal testing on strings of various lengths, it appears that translate
does not perform better (it's even on short strings, but really worse as strings' length increase), so I'd say we stick to successive replace
Also tested alternatives to clean strings:
unidecode.unidecode
is more universal but really worse (2.5x longer).translate
is more efficient but doesn't work if we want to replace strings that are longer than 1 character, which we do (for double blanks for instance). So we have to use.replace
afterwards anyway (fewer times though), and that brings the timing up to the same standard as when just using.replace
all the way@maudetes thoughts?