fix: add more characters to clean

Pierlou commented 10 months ago

Also tested alternatives to clean strings:

using unidecode.unidecode is more universal but really worse (2.5x longer)
.translate is more efficient but doesn't work if we want to replace strings that are longer than 1 character, which we do (for double blanks for instance). So we have to use .replace afterwards anyway (fewer times though), and that brings the timing up to the same standard as when just using .replace all the way

@maudetes thoughts?

maudetes commented 10 months ago

If _process_text is worth investigating a bit more, I would make sure the multiple replace are taking most of the time (might depend on string length)?

Pierlou commented 10 months ago

What is taking the most time within this function is camel_case_split, which splits strings that are written in camel case (hautsDeSeine => hauts de seine). This is twice as long as the replacing part, but I'm not sure if/how we can optimize it (seems quite concise to me as it is). I'll push the hybrid version (using translate for single-character replacements and replace for longer ones), this seems to be the best trade-off. Update: after additionnal testing on strings of various lengths, it appears that translate does not perform better (it's even on short strings, but really worse as strings' length increase), so I'd say we stick to successive replace

datagouv / csv-detective

fix: add more characters to clean #71