Closed yt605155624 closed 2 years ago
@yt605155624 @BarryKCL Thanks for your effort. Great job! Could you need some help?
@BarryKCL Great work of converting torch model to onnx. Could I invite you to give a PR which replaces models with onnx to speed up?
@GitYCC not yet, but after merging and using by community users of PaddleSpeech, maybe we will find some problems which need your help, there is a little bug in g2pw, there are some words didn't in your 简体->繁体 dict (sorry I don't really understand taiwan or minnan lang, maybe using 简体 dict will be more convenient for our mainland users), you can try to input this word "概念",in @BarryKCL 's script, he added a "try catch" to avoid this bug, and use g2pM
as backup, please check https://github.com/PaddlePaddle/PaddleSpeech/blob/aecf8fd3844371abcce5d337fab83aae6807285b/paddlespeech/t2s/frontend/zh_frontend.py#L186
@yt605155624
The root cause of this problem is because our model is trained on the Traditional Chinese (繁体) dataset. So, if we want to apply on the cases of 简体, I need to use package OpenCC
to convert them. But OpenCC
still has some cases that can not be converted very well. Maybe we could seek a better method to do this conversion.
We have also tried opencc for 繁体 -> 简体 -> 繁体 in PaddleSpeech TTS, but cause opencc has some bug when install in windows (not sure if this is still a bug now), we remove opencc and look up table
(maybe this table was copy from somewhere in github I don't remember) now, you can check https://github.com/PaddlePaddle/PaddleSpeech/blob/0eb598b876f99bd26fa735577da92d46c45dc3fd/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py#L81 , but I'm not sure if it works well for taiwan/minnan 繁体
But let me put it another way, if it's possible you :
I don't know the complexity of this task, because I don't understand the Minnan language at all.. 🥺
The problem still exists. In order to convert the dataset into 简体, we need a Good 繁体 -> 简体 converter. If we have the 繁体 -> 简体 converter, our first problem has been solved. XD
Maybe just use the look-up table (https://github.com/PaddlePaddle/PaddleSpeech/blob/0eb598b876f99bd26fa735577da92d46c45dc3fd/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py#L81 ) to solve this problem temporarily.
Of course, I will submit the onnxruntime code as soon as possible.
Of course, I will submit the onnxruntime code as soon as possible.
Thank you!
@GitYCC but even use a not too good 繁体 -> 简体 converter, you can also get a dataset (but maybe the number of available data will be reduced), I don't know whether the reduction of dataset will reduce the effect of model for 简体 g2pw
Actually, in my opinion, no matter whether the 繁体 -> 简体 converter is good or not, if we just use the such converter, the effect of "pre-use on dataset" is same as the one of "post-use on changing input", because filtered error cases of "pre-use on dataset" would not be trained on models and such models still can not deal with missing char. from conversion.
I'm not very familiar with NLP, I naively thought:
The G2P BERT will only have to be trained for polyphonic ones, not word such "概念", even "概念" will not in your datasets after 繁体 -> 简体 converter, but pretrained BERT must has seen "概念" before.
Because people in mainland China use simplified Chinese, I naively thought that there might not be as many "missing chars" in simplified Chinese as traditional Chinese for the pre-trained Bert vocab, for example, simplified "概念" maybe in BERT's vocab, but opencc cannot convert simplified "概念" to traditional "概念", and even traditional "概念" in BERT's vocab, traditional g2pw still cannot deal will simplified "概念" input
g2pw is an excellent job, I think it will have a great influence in the Chinese community (more of them use simplified Chinese). If it'is blocked by a bad converter, I will be very sad
I have an idea to get a good look-up table. We can use the google translation to help us.
Like this way,
I will change the converter by this way in the future.
good idea
oh, I just found that, when input "概念", the error not because of 简体 -> 繁体 converter, but because there are not polyphone in "概念", so the texts
output of prepare_data
is []
, so the input of bert is []
.. maybe an empty judgment will fix this
sent before convert: 概念,
sent after convert: 概念,
sentences: ['概念,']
[] [] [] [['gai4', 'nian4', None]]
texts: []
onnx_input: {'input_ids': array([], dtype=float64), 'token_type_ids': array([], dtype=float64), 'attention_masks': array([], dtype=float64), 'phoneme_masks': array([], dtype=float32), 'char_ids': array([], dtype=float64), 'position_ids': array([], dtype=float64)}
[概念,] not in g2pW dict,use g2pM
sent before convert: 你我,
sent after convert: 你我,
sentences: ['你我,']
[] [] [] [['ni3', 'wo3', None]]
texts: []
onnx_input: {'input_ids': array([], dtype=float64), 'token_type_ids': array([], dtype=float64), 'attention_masks': array([], dtype=float64), 'phoneme_masks': array([], dtype=float32), 'char_ids': array([], dtype=float64), 'position_ids': array([], dtype=float64)}
[你我,] not in g2pW dict,use g2pM
sent before convert: 你好
sent after convert: 你好
sentences: ['你好']
char in polyphonic_chars: 好
['你好'] [1] [0] [['ni3', None]]
texts: ['你好']
onnx_input: {'input_ids': array([[ 101, 872, 1962, 102]]), 'token_type_ids': array([[0, 0, 0, 0]]), 'attention_masks': array([[1, 1, 1, 1]]), 'phoneme_masks': array([[0., 0., 0., ..., 0., 0., 0.]], dtype=float32), 'char_ids': array([580]), 'position_ids': array([2])}
maybe you should check this also https://github.com/GitYCC/g2pW/blob/ece11b8dfad0c3ecf25a4e1cfa3274485b447ecf/scripts/predict_g2p_bert.py#L30
Thanks for catching bugs. #10
- Add g2pW to Chinese frontend PaddlePaddle/PaddleSpeech#2230
Hi, I'm not familiar with PaddleSpeech, now I only want to use g2p in PaddleSpeech to get the pinyin of sentences (in order to speed up the pinyin generation), could you give a tiny code example? Thanks a lot!
please check https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/g2p @beyondguo
Thanks for the excellent work, @BarryKCL has included g2pW into PaddleSpeech , which greatly improves the accuracy of polyphone.
PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.
@BarryKCL converted the torch model to
onnx
and replace dependencetransformers
withpaddlenlp
.Thank you again from the bottom of my heart !😘
check: