OpenRefine / OpenRefine

OpenRefine is a free, open source power tool for working with messy data and improving it
https://openrefine.org/
BSD 3-Clause "New" or "Revised" License
10.94k stars 1.97k forks source link

Menu and function for removing diacritics #2295

Open msaby opened 4 years ago

msaby commented 4 years ago

Is your feature request related to a problem or area of OpenRefine? Please describe.

It could be useful to have a a menu and a GREL function to remove diacritics in strings.

Ex :

"école" -> "ecole"

wetneb commented 4 years ago

Isn't that already available as a fingerprint function? If not it could potentially be added as such since it is possible to call clustering functions from GREL.

msaby commented 4 years ago

I was thinking of something less agressive than fingerprint : "L'école et les ecoles" -> "L'ecole et les ecoles"

thadguidry commented 4 years ago

@msaby see if the following helps you out:

  1. https://github.com/OpenRefine/OpenRefine/wiki/Recipes#9-encoding-issues
  2. https://github.com/OpenRefine/OpenRefine/wiki/Extending-Jython-with-pypi-modules#how-to-replace-diacritic-characters
thadguidry commented 4 years ago

This seems to be fairly easy enough to do now if we simply use Apache StringUtils stripAccents

I suggest for labeling simplicity (translations) to call the new GREL function the same, stripAccents().

tfmorris commented 4 years ago

I'd like to see a more general approach to text normalization than just removing diacritics. We also need to deal with normalizing the various composed vs decomposed forms. Other related issues include #409 and #650.

I'm removing the "good second issue" label until we have the design nailed down. One possible approach would be to create a normalize function with different "strengths" of normalization to apply (decomposition, diacritic removal, case folding, etc).

thadguidry commented 4 years ago

@tfmorris Sounds good Tom. I would always trust you for expertise with localization and international support anyways :-)