Closed TheHonestBob closed 2 years ago
@azraelkuan can you help ?
@TheHonestBob for 1, for my understand, "trim_silence = True" do not have conflict, if use mfa as align, u will get the sil label in the begin and end, so we can cut the silence from the audio begining and end according to the mfa result, this will be much more accurate.
for 2, because we have a prosody label in training dataset, so we can use #1-#4 to train, but in inference, usually we only have text, like "我们在这里",we convert this to phonemes, also if u want to use #1-#4 in inference, u need to build a simple model to predict the prosody label, like bert or others.
thanks for your reply,for 1,I get duration by this way before, now ,I want to use tacotron2 to extract duration, as follow steps:
we process the mel after the trim silence, so we just need to extract duration, so the mel and duration is aligned by tacotron2. we do not need to run preprocess again.
we process the mel after the trim silence, so we just need to extract duration, so the mel and duration is aligned by tacotron2. we do not need to run preprocess again.
mean that when I use tacotron2 to extract duration, trim silence by librosa.effects.trim, not like mfa use 'sil' to trim silence.
i think that u do not need to trim silence after extract duration.
i think that u do not need to trim silence after extract duration.
thank you every much.
I have two question: 1.when process baker dataset, in get_phoneme_from_char_and_pinyin function, 'sil' is added in result, but in preprocess.py gen_audio_features function, trim_silence parameter is True,is there a conflict? 2.when define train dataset, '#0,#1,#2,#3' is added in return, but in inference is not, so what's the effect of '#0,#1,#2,#3' 。 thanks a lot!