Kyubyong / g2pC

g2pC: A Context-aware Grapheme-to-Phoneme Conversion module for Chinese
Apache License 2.0
235 stars 30 forks source link
chinese-nlp chinese-word-segmentation crf crfsuite g2p pinyin

image image image

g2pC: A Context-aware Grapheme-to-Phoneme for Chinese

There are several open source libraries of Chinese grapheme-to-phoneme conversion such as python-pinyin or xpinyin. However, none of them seem to disambiguate Chinese polyphonic words like "行" ("xíng" (go, walk) vs. "háng" (line)) or "了" ("le" (completed action marker) vs. "liǎo" (finish, achieve)). Instead, they pick up the most frequent pronunciation. Although that may be a simple and economic strategy, machine learning techniques can be of help. We use CRF to determine the pronunciation of polyphonic words. In addition to the target word itself and its part-of-speech, which are tagged by pkuseg, its neighboring words are also featurized.

Requirements

Model # Correct # Incorrect Acc. (%)
g2pC (0.9.9.3) 13,033 158 98.80
pypinyin (0.35.3) 12,975 216 98.36
xpinyin (0.5.6) 12,838 353 97.32

Accuracy

Changelog

0.9.9.3 July 10, 2019

0.9.9.2 July 10, 2019

0.9.9.1 July 9, 2019

0.9.6. July 7, 2019