Closed sciai-ai closed 2 years ago
'feat_gen'
is normalized one. And you performed normalization again (normalize_before=True
).
You should use feat_gen_denorm
-> normalized_before=True
or feat_gen
-> normalize_before=False
.
I tried what you suggested but yet the difference exists. The sythesized quality is the same for both cases but the duration and the style of speech is different for the same text input. I think the mel spectograms produced are different in both cases.
It depends on the model.
Some text2mel models always use dropout, and some vocoders uses noise as the input.
If you want to fix it, please try always_fix_seed
option.
https://github.com/espnet/espnet/blob/986dadbf85cc0af1e5c9ca8c71754724ef8e1c6f/espnet2/bin/tts_inference.py#L85
Yes i had fixed the seed for both taco2 and mbmelgan yet see the difference
Please share the reproducible code.
..
Your code did not do the same thing.
wav = text2speech(text)["wav"]
perform taco2 inference and vocoder inference.
So torch.manual_seed(3)
has no effect.
Upper case
Below case
How can i make it exactly the same then
Why do you want to separate?
for espnet_onnx, i want to use the onnx tacotron2 model as the PWG vocoders, such as PWG and MBmelgan are currently not supported. Hence I want to separaate to use the non onnx vocoder
See https://github.com/Masao-Someki/espnet_onnx/issues/29#issuecomment-1155224738
Please read this part: https://github.com/espnet/espnet/blob/986dadbf85cc0af1e5c9ca8c71754724ef8e1c6f/espnet2/bin/tts_inference.py#L193-L213
Second one should be:
from espnet2.bin.tts_inference import Text2Speech
from parallel_wavegan.utils import load_model
from IPython.display import display, Audio
import torch
import time
vocoder = load_model(vocoder_path).to(device_type).eval()
vocoder.remove_weight_norm()
torch.manual_seed(3)
text2speech = Text2Speech.from_pretrained(model_file=model_path,
train_config=train_config,
device=device_type,
# Only for Tacotron 2
threshold=0.5,
minlenratio=0.0,
maxlenratio=20.0,
use_att_constraint=True,
backward_window=3,
forward_window=7,
# Only for FastSpeech & FastSpeech2
speed_control_alpha=1.0,)
# remove vocoder
text2speech.vocoder = None
start = time.time()
with torch.no_grad():
torch.manual_seed(343)
mel = text2speech(text)["feat_gen_denorm"]
wav = vocoder.inference(torch.tensor(mel), normalize_before=True)
wav = wav.view(-1).detach().numpy()
elapsed_time = time.time() - start
print(elapsed_time)
display(Audio(wav, rate=22050))
for espnet_onnx, i want to use the onnx tacotron2 model as the PWG vocoders, such as PWG and MBmelgan are currently not supported. Hence I want to separaate to use the non onnx vocoder
I still could not understand your motivation why you want to check output consistency in separated case.
As the model is in production, I only want to reduce the inference speed for my next model, but dont want to change the inference output style
As the model is in production, I only want to reduce the inference speed for my next model, but dont want to change the inference output style
So why do you need to compare the outputs between integrated case and separated case. I think what you need to check is the consistency of separated case.
E.g.,
with torch.no_grad():
torch.manual_seed(343)
mel1 = text2speech(text)["feat_gen_denorm"]
wav1 = vocoder.inference(torch.tensor(mel), normalize_before=True)
with torch.no_grad():
torch.manual_seed(343)
mel2 = text2speech(text)["feat_gen_denorm"]
wav2 = vocoder.inference(torch.tensor(mel), normalize_before=True)
# check mel1 and me2 are the same / wav1 and wav2 are the same
Yes you are right, actually I have been using espnet 0.10.0 in production where there was always the separated case. In recent espnet version which is supported by espnet_onnx, wav file was generated directly.
After doing this comparison, I think the espnet 0.10.4 with seed 343 do not produce the same output as latest espnet version with the same seed. I will run some more tests to confirm if some other dependency or package got changed
After doing this comparison, I think the espnet 0.10.4 with seed 343 do not produce the same output as latest espnet version with the same seed.
I'm not sure the consistency among the different versions (it may also be related to pytorch version). And if you want to compare with onnx outputs, it may not guarantee the totally same as outputs in pytorch.
yes that could be the reason as well. Thanks for your kind inputs
.
.
How can I make the Text2Speech not make the griffin-lim vocoder (to save compute time) as I use the vocoder directly on mel spectrogram features.
See this part
# remove vocoder
text2speech.vocoder = None
I investigated this further, it appears the at the audio quality is very different between case 1 and 2 and cannot be explained by the difference i
I think this is caused by the different seed on tacotron2. You can confirm it by generating two audios with different seed.
Hi, I tried synthesising the wav file using 2 ways
Decoding using text2speech -> feat_gen -> vocoder produces different results than text2speech -> wav