TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.8k stars 810 forks source link

About process baker dataset #704

Closed TheHonestBob closed 2 years ago

TheHonestBob commented 2 years ago

I have two question: 1.when process baker dataset, in get_phoneme_from_char_and_pinyin function, 'sil' is added in result, but in preprocess.py gen_audio_features function, trim_silence parameter is True,is there a conflict? 2.when define train dataset, '#0,#1,#2,#3' is added in return, but in inference is not, so what's the effect of '#0,#1,#2,#3' 。 thanks a lot!

dathudeptrai commented 2 years ago

@azraelkuan can you help ?

azraelkuan commented 2 years ago

@TheHonestBob for 1, for my understand, "trim_silence = True" do not have conflict, if use mfa as align, u will get the sil label in the begin and end, so we can cut the silence from the audio begining and end according to the mfa result, this will be much more accurate.

for 2, because we have a prosody label in training dataset, so we can use #1-#4 to train, but in inference, usually we only have text, like "我们在这里",we convert this to phonemes, also if u want to use #1-#4 in inference, u need to build a simple model to predict the prosody label, like bert or others.

TheHonestBob commented 2 years ago

thanks for your reply,for 1,I get duration by this way before, now ,I want to use tacotron2 to extract duration, as follow steps:

  1. run preporcess.py, if I don't use mfa, preprocess.py will use librosa.effects.trim.
  2. create tf.dataset by 1.
  3. run extract_duration.py in 1, sil is trimed, so tacotron2 predict 'sil' has no effect, should I set trim_silence = Flase when train tacotron2 or extract duration by tacotron2.
azraelkuan commented 2 years ago

we process the mel after the trim silence, so we just need to extract duration, so the mel and duration is aligned by tacotron2. we do not need to run preprocess again.

TheHonestBob commented 2 years ago

we process the mel after the trim silence, so we just need to extract duration, so the mel and duration is aligned by tacotron2. we do not need to run preprocess again.

mean that when I use tacotron2 to extract duration, trim silence by librosa.effects.trim, not like mfa use 'sil' to trim silence.

azraelkuan commented 2 years ago

i think that u do not need to trim silence after extract duration.

TheHonestBob commented 2 years ago

i think that u do not need to trim silence after extract duration.

thank you every much.