espnet / espnet

End-to-End Speech Processing Toolkit
https://espnet.github.io/espnet/
Apache License 2.0
8.4k stars 2.17k forks source link

Text2Speech produces different outputs. #4448

Closed sciai-ai closed 2 years ago

sciai-ai commented 2 years ago

Hi, I tried synthesising the wav file using 2 ways

Decoding using text2speech -> feat_gen -> vocoder produces different results than text2speech -> wav

text2speech = Text2Speech('your model')

# inference only text2mel
mel = text2speech(text)['feat_gen']
vocoder = load_model(vocoder_path).to(device_type).eval()
vocoder.remove_weight_norm()
wav = vocoder.inference(torch.tensor(mel), normalize_before=True)
text2speech = Text2Speech('your model')

# inference only text2mel
wav = text2speech(text)['wav']
kan-bayashi commented 2 years ago

'feat_gen' is normalized one. And you performed normalization again (normalize_before=True). You should use feat_gen_denorm -> normalized_before=True or feat_gen -> normalize_before=False.

sciai-ai commented 2 years ago

I tried what you suggested but yet the difference exists. The sythesized quality is the same for both cases but the duration and the style of speech is different for the same text input. I think the mel spectograms produced are different in both cases.

kan-bayashi commented 2 years ago

It depends on the model. Some text2mel models always use dropout, and some vocoders uses noise as the input. If you want to fix it, please try always_fix_seed option. https://github.com/espnet/espnet/blob/986dadbf85cc0af1e5c9ca8c71754724ef8e1c6f/espnet2/bin/tts_inference.py#L85

sciai-ai commented 2 years ago

Yes i had fixed the seed for both taco2 and mbmelgan yet see the difference

kan-bayashi commented 2 years ago

Please share the reproducible code.

sciai-ai commented 2 years ago

..

kan-bayashi commented 2 years ago

Your code did not do the same thing. wav = text2speech(text)["wav"] perform taco2 inference and vocoder inference. So torch.manual_seed(3) has no effect.

Upper case

Below case

sciai-ai commented 2 years ago

How can i make it exactly the same then

kan-bayashi commented 2 years ago

Why do you want to separate?

sciai-ai commented 2 years ago

for espnet_onnx, i want to use the onnx tacotron2 model as the PWG vocoders, such as PWG and MBmelgan are currently not supported. Hence I want to separaate to use the non onnx vocoder

See https://github.com/Masao-Someki/espnet_onnx/issues/29#issuecomment-1155224738

kan-bayashi commented 2 years ago

Please read this part: https://github.com/espnet/espnet/blob/986dadbf85cc0af1e5c9ca8c71754724ef8e1c6f/espnet2/bin/tts_inference.py#L193-L213

Second one should be:

from espnet2.bin.tts_inference import Text2Speech
from parallel_wavegan.utils import load_model
from IPython.display import display, Audio
import torch 
import time
vocoder = load_model(vocoder_path).to(device_type).eval()
vocoder.remove_weight_norm()

torch.manual_seed(3)
text2speech = Text2Speech.from_pretrained(model_file=model_path,
                                     train_config=train_config,
                                     device=device_type,
    # Only for Tacotron 2
    threshold=0.5,
    minlenratio=0.0,
    maxlenratio=20.0,
    use_att_constraint=True,
    backward_window=3,
    forward_window=7,
    # Only for FastSpeech & FastSpeech2
    speed_control_alpha=1.0,)

# remove vocoder
text2speech.vocoder = None

start = time.time()
with torch.no_grad():
    torch.manual_seed(343)
    mel = text2speech(text)["feat_gen_denorm"]
    wav = vocoder.inference(torch.tensor(mel), normalize_before=True)

wav = wav.view(-1).detach().numpy()

elapsed_time = time.time() - start
print(elapsed_time)
display(Audio(wav, rate=22050))
kan-bayashi commented 2 years ago

for espnet_onnx, i want to use the onnx tacotron2 model as the PWG vocoders, such as PWG and MBmelgan are currently not supported. Hence I want to separaate to use the non onnx vocoder

I still could not understand your motivation why you want to check output consistency in separated case.

sciai-ai commented 2 years ago

As the model is in production, I only want to reduce the inference speed for my next model, but dont want to change the inference output style

kan-bayashi commented 2 years ago

As the model is in production, I only want to reduce the inference speed for my next model, but dont want to change the inference output style

So why do you need to compare the outputs between integrated case and separated case. I think what you need to check is the consistency of separated case.

E.g.,

with torch.no_grad():
    torch.manual_seed(343)
    mel1 = text2speech(text)["feat_gen_denorm"]
    wav1 = vocoder.inference(torch.tensor(mel), normalize_before=True)

with torch.no_grad():
    torch.manual_seed(343)
    mel2 = text2speech(text)["feat_gen_denorm"]
    wav2 = vocoder.inference(torch.tensor(mel), normalize_before=True)

# check mel1 and me2 are the same / wav1 and wav2 are the same
sciai-ai commented 2 years ago

Yes you are right, actually I have been using espnet 0.10.0 in production where there was always the separated case. In recent espnet version which is supported by espnet_onnx, wav file was generated directly.

After doing this comparison, I think the espnet 0.10.4 with seed 343 do not produce the same output as latest espnet version with the same seed. I will run some more tests to confirm if some other dependency or package got changed

kan-bayashi commented 2 years ago

After doing this comparison, I think the espnet 0.10.4 with seed 343 do not produce the same output as latest espnet version with the same seed.

I'm not sure the consistency among the different versions (it may also be related to pytorch version). And if you want to compare with onnx outputs, it may not guarantee the totally same as outputs in pytorch.

sciai-ai commented 2 years ago

yes that could be the reason as well. Thanks for your kind inputs

sciai-ai commented 2 years ago

.

sciai-ai commented 2 years ago

.

kan-bayashi commented 2 years ago

How can I make the Text2Speech not make the griffin-lim vocoder (to save compute time) as I use the vocoder directly on mel spectrogram features.

See this part

# remove vocoder
text2speech.vocoder = None
kan-bayashi commented 2 years ago

I investigated this further, it appears the at the audio quality is very different between case 1 and 2 and cannot be explained by the difference i

I think this is caused by the different seed on tacotron2. You can confirm it by generating two audios with different seed.