GitYCC / g2pW

Chinese Mandarin Grapheme-to-Phoneme Converter. 中文轉注音或拼音 (INTERSPEECH 2022)
https://arxiv.org/abs/2203.10430
Apache License 2.0
278 stars 38 forks source link

Include g2pW into PaddleSpeech TTS #5

Closed yt605155624 closed 2 years ago

yt605155624 commented 2 years ago

Thanks for the excellent work, @BarryKCL has included g2pW into PaddleSpeech , which greatly improves the accuracy of polyphone.

PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.

@BarryKCL converted the torch model to onnx and replace dependence transformers with paddlenlp.

Thank you again from the bottom of my heart !😘

check:

GitYCC commented 2 years ago

@yt605155624 @BarryKCL Thanks for your effort. Great job! Could you need some help?

@BarryKCL Great work of converting torch model to onnx. Could I invite you to give a PR which replaces models with onnx to speed up?

yt605155624 commented 2 years ago

@GitYCC not yet, but after merging and using by community users of PaddleSpeech, maybe we will find some problems which need your help, there is a little bug in g2pw, there are some words didn't in your 简体->繁体 dict (sorry I don't really understand taiwan or minnan lang, maybe using 简体 dict will be more convenient for our mainland users), you can try to input this word "概念",in @BarryKCL 's script, he added a "try catch" to avoid this bug, and use g2pM as backup, please check https://github.com/PaddlePaddle/PaddleSpeech/blob/aecf8fd3844371abcce5d337fab83aae6807285b/paddlespeech/t2s/frontend/zh_frontend.py#L186

GitYCC commented 2 years ago

@yt605155624 The root cause of this problem is because our model is trained on the Traditional Chinese (繁体) dataset. So, if we want to apply on the cases of 简体, I need to use package OpenCC to convert them. But OpenCC still has some cases that can not be converted very well. Maybe we could seek a better method to do this conversion.

yt605155624 commented 2 years ago

We have also tried opencc for 繁体 -> 简体 -> 繁体 in PaddleSpeech TTS, but cause opencc has some bug when install in windows (not sure if this is still a bug now), we remove opencc and look up table (maybe this table was copy from somewhere in github I don't remember) now, you can check https://github.com/PaddlePaddle/PaddleSpeech/blob/0eb598b876f99bd26fa735577da92d46c45dc3fd/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py#L81 , but I'm not sure if it works well for taiwan/minnan 繁体

But let me put it another way, if it's possible you :

  1. covert your Traditional Chinese (繁体) dataset to 简体 dataset
  2. train a 简体 G2PW model 😍

I don't know the complexity of this task, because I don't understand the Minnan language at all.. 🥺

GitYCC commented 2 years ago

The problem still exists. In order to convert the dataset into 简体, we need a Good 繁体 -> 简体 converter. If we have the 繁体 -> 简体 converter, our first problem has been solved. XD

Maybe just use the look-up table (https://github.com/PaddlePaddle/PaddleSpeech/blob/0eb598b876f99bd26fa735577da92d46c45dc3fd/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py#L81 ) to solve this problem temporarily.

BarryKCL commented 2 years ago

Of course, I will submit the onnxruntime code as soon as possible.

GitYCC commented 2 years ago

Of course, I will submit the onnxruntime code as soon as possible.

Thank you!

yt605155624 commented 2 years ago

@GitYCC but even use a not too good 繁体 -> 简体 converter, you can also get a dataset (but maybe the number of available data will be reduced), I don't know whether the reduction of dataset will reduce the effect of model for 简体 g2pw

GitYCC commented 2 years ago

Actually, in my opinion, no matter whether the 繁体 -> 简体 converter is good or not, if we just use the such converter, the effect of "pre-use on dataset" is same as the one of "post-use on changing input", because filtered error cases of "pre-use on dataset" would not be trained on models and such models still can not deal with missing char. from conversion.

yt605155624 commented 2 years ago

I'm not very familiar with NLP, I naively thought:

  1. The G2P BERT will only have to be trained for polyphonic ones, not word such "概念", even "概念" will not in your datasets after 繁体 -> 简体 converter, but pretrained BERT must has seen "概念" before.

  2. Because people in mainland China use simplified Chinese, I naively thought that there might not be as many "missing chars" in simplified Chinese as traditional Chinese for the pre-trained Bert vocab, for example, simplified "概念" maybe in BERT's vocab, but opencc cannot convert simplified "概念" to traditional "概念", and even traditional "概念" in BERT's vocab, traditional g2pw still cannot deal will simplified "概念" input

g2pw is an excellent job, I think it will have a great influence in the Chinese community (more of them use simplified Chinese). If it'is blocked by a bad converter, I will be very sad

GitYCC commented 2 years ago

I have an idea to get a good look-up table. We can use the google translation to help us.

Like this way, image

I will change the converter by this way in the future.

yt605155624 commented 2 years ago

good idea

yt605155624 commented 2 years ago

oh, I just found that, when input "概念", the error not because of 简体 -> 繁体 converter, but because there are not polyphone in "概念", so the texts output of prepare_data is [], so the input of bert is [].. maybe an empty judgment will fix this

sent before convert: 概念,
sent after convert: 概念,
sentences: ['概念,']
[] [] [] [['gai4', 'nian4', None]]
texts: []
onnx_input: {'input_ids': array([], dtype=float64), 'token_type_ids': array([], dtype=float64), 'attention_masks': array([], dtype=float64), 'phoneme_masks': array([], dtype=float32), 'char_ids': array([], dtype=float64), 'position_ids': array([], dtype=float64)}
[概念,] not in g2pW dict,use g2pM
sent before convert: 你我,
sent after convert: 你我,
sentences: ['你我,']
[] [] [] [['ni3', 'wo3', None]]
texts: []
onnx_input: {'input_ids': array([], dtype=float64), 'token_type_ids': array([], dtype=float64), 'attention_masks': array([], dtype=float64), 'phoneme_masks': array([], dtype=float32), 'char_ids': array([], dtype=float64), 'position_ids': array([], dtype=float64)}
[你我,] not in g2pW dict,use g2pM
sent before convert: 你好
sent after convert: 你好
sentences: ['你好']
char in polyphonic_chars: 好
['你好'] [1] [0] [['ni3', None]]
texts: ['你好']
onnx_input: {'input_ids': array([[ 101,  872, 1962,  102]]), 'token_type_ids': array([[0, 0, 0, 0]]), 'attention_masks': array([[1, 1, 1, 1]]), 'phoneme_masks': array([[0., 0., 0., ..., 0., 0., 0.]], dtype=float32), 'char_ids': array([580]), 'position_ids': array([2])}

maybe you should check this also https://github.com/GitYCC/g2pW/blob/ece11b8dfad0c3ecf25a4e1cfa3274485b447ecf/scripts/predict_g2p_bert.py#L30

GitYCC commented 2 years ago

Thanks for catching bugs. #10

GitYCC commented 2 years ago

11

beyondguo commented 2 years ago
  • Add g2pW to Chinese frontend PaddlePaddle/PaddleSpeech#2230

Hi, I'm not familiar with PaddleSpeech, now I only want to use g2p in PaddleSpeech to get the pinyin of sentences (in order to speed up the pinyin generation), could you give a tiny code example? Thanks a lot!

yt605155624 commented 2 years ago

please check https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/g2p @beyondguo