datagouv / csv-detective

CSV inspection
45 stars 10 forks source link

fix: add more characters to clean #71

Closed Pierlou closed 10 months ago

Pierlou commented 10 months ago

Also tested alternatives to clean strings:

@maudetes thoughts?

maudetes commented 10 months ago

If _process_text is worth investigating a bit more, I would make sure the multiple replace are taking most of the time (might depend on string length)?

Pierlou commented 10 months ago

What is taking the most time within this function is camel_case_split, which splits strings that are written in camel case (hautsDeSeine => hauts de seine). This is twice as long as the replacing part, but I'm not sure if/how we can optimize it (seems quite concise to me as it is). I'll push the hybrid version (using translate for single-character replacements and replace for longer ones), this seems to be the best trade-off. Update: after additionnal testing on strings of various lengths, it appears that translate does not perform better (it's even on short strings, but really worse as strings' length increase), so I'd say we stick to successive replace