Open szha opened 4 years ago
Usually, this is done as part of the normalization. There is usually implemented via the strip_accents
flag which is supported in common tokenizers like WordPiece.
A very common example is to convert beyoncé
to beyonce
. We can use official python to do that:
import unicodedata
token = 'beyoncé'
token = unicodedata.normalize('NFKD', token)
token = ''.join([ele for ele in token if not unicodedata.combining(ele)])
print(token)
Output:
'beyonce'
Description
For subword tokenizers it's desirable to have explicit control on whether to normalize diacritics https://nlp.stanford.edu/IR-book/html/htmledition/accents-and-diacritics-1.html