tokenizer should support options for diacritics normalization

dmlc / gluon-nlp

NLP made easy

https://nlp.gluon.ai/

Apache License 2.0

2.56k stars 538 forks source link

tokenizer should support options for diacritics normalization #1231

Open szha opened 4 years ago

szha commented 4 years ago

Description

For subword tokenizers it's desirable to have explicit control on whether to normalize diacritics https://nlp.stanford.edu/IR-book/html/htmledition/accents-and-diacritics-1.html

sxjscience commented 4 years ago

Usually, this is done as part of the normalization. There is usually implemented via the strip_accents flag which is supported in common tokenizers like WordPiece.

A very common example is to convert beyoncé to beyonce. We can use official python to do that:

import unicodedata
token = 'beyoncé'
token = unicodedata.normalize('NFKD', token)
token = ''.join([ele for ele in token if not unicodedata.combining(ele)])
print(token)

Output:

'beyonce'