espnet / espnet_onnx

Onnx wrapper for espnet infrernce model
MIT License
156 stars 23 forks source link

The inference results of espnet_onnx are inconsistent with espnet. #111

Closed hexisyztem closed 4 weeks ago

hexisyztem commented 6 months ago

After converting the FastSpeech2 model with espnet_onnx, the audio generated by the model is distorted.

Using the model: kan-bayashi/jsut_fastspeech2

Download method:

python

from espnet_model_zoo.downloader import ModelDownloader
d = ModelDownloader("~/.cache/espnet")
d.download_and_unpack("kan-bayashi/jsut_fastspeech2")

Python inference method

import torch
from espnet2.bin.tts_inference import Text2Speech

# Set the paths for the model file and configuration file
model_path = "~/.cache/espnet/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/train.loss.ave_5best.pth"  # Model weights file
config_path = "~/.cache/espnet/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/config.yaml"  # Configuration file

# Load the model
model = Text2Speech.from_pretrained(model_tag=None, train_config=config_path, model_file=model_path, device='cuda:0')

# Text to be synthesized
text = "私はあなたに好意を持っていますが、あなたが少し控えめなのが感じられます。それが社交的な不安からなのか、それとも私に対する興味がないからなのか、わかりません。もし可能であれば、もう少し積極的になってくれませんか?"

# Perform speech synthesis
with torch.no_grad():
    wav = model(text)["wav"]

Convert to ONNX format

import torch
from espnet2.bin.tts_inference import Text2Speech
from espnet_onnx.export import TTSModelExport

# Set the paths for the model file and configuration file
model_path = "~/.cache/espnet/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/train.loss.ave_5best.pth"  # Model weights file
config_path = "~/.cache/espnet/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/config.yaml"  # Configuration file

# Load the model
model = Text2Speech.from_pretrained(model_tag=None, train_config=config_path, model_file=model_path)

ex = TTSModelExport()
ex.export(model=model)

ONNX inference method

from espnet_onnx import Text2Speech
PROVIDERS = ['CUDAExecutionProvider']
text2speech = Text2Speech(model_dir='${path_to_onnx_model}', providers=PROVIDERS)
text = "私はあなたに好意を持っていますが、あなたが少し控えめなのが感じられます。それが社交的な不安からなのか、それとも私に対する興味がないからなのか、わかりません。もし可能であれば、もう少し積極的になってくれませんか?"
output_dict = text2speech(text)
wav = output_dict['wav']
hexisyztem commented 6 months ago

By the way, I found that the results of espnet2.tts.fastspeech2.fastspeech2.FastSpeech2 and espnet_onnx.export.tts.models.tts_models.fastspeech2.OnnxFastSpeech2 are inconsistent. Logically, these two should be completely equivalent, with the only difference being that the former does not support tensor.onnx.export, while the latter does.

hexisyztem commented 6 months ago

for example,

import torch
import numpy as np
from espnet2.bin.tts_inference import Text2Speech

# 设置模型文件和配置文件的路径
model_path = "/workspace/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/train.loss.ave_5best.pth"  # 模型权重文件
config_path = "/workspace/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/config.yaml"  # 配置文件

# 加载模型
model = Text2Speech.from_pretrained(model_tag=None,train_config=config_path, model_file=model_path, device='cpu')

# 要合成的文本
text = "私はあなたに好意を持っていますが、あなたが少し控えめなのが感じられます。それが社交的な不安からなのか、それとも私に対する興味がないからなのか、わかりません。もし可能であれば、もう少し積極的になってくれませんか?"

# 进行语音合成
with torch.no_grad():
    wav = model(text)["wav"]
# 得到 python 的推理结果

from espnet_onnx.export.tts.models.tts_models.fastspeech2 import OnnxFastSpeech2
export_config = dict(
    opset_version=12,
    max_seq_len=2048,
)
onnx_pymodel = OnnxFastSpeech2(model.model.tts, **export_config)

text_tensor = torch.tensor([25, 41, 34, 41, 28, 39, 25, 41, 41, 35, 41, 34, 41, 35, 39, 36, 40, 40,
                   39, 40, 30, 40, 22, 34, 37, 39, 30, 41, 32, 24, 26, 41, 29, 41, 35, 41,
                   34, 41, 26, 41, 32, 24, 36, 40, 28, 23, 21, 39, 36, 41, 37, 30, 37, 35,
                   41, 35, 40, 26, 41, 36, 41, 31, 18, 39, 33, 41, 33, 37, 30, 41, 32, 24,
                   29, 32, 40, 33, 37, 26, 41, 28, 41, 36, 40, 40, 34, 37, 36, 39, 35, 41,
                   13, 38, 41, 31, 36, 41, 33, 41, 35, 41, 35, 40, 36, 41, 29, 32, 40, 33,
                   37, 34, 40, 30, 40, 25, 41, 34, 41, 28, 39, 35, 39, 34, 41, 39, 32, 38,
                   33, 38, 12, 40, 40, 30, 39, 26, 41, 35, 41, 39, 36, 41, 33, 41, 35, 41,
                   35, 40, 36, 41, 29, 25, 41, 36, 41, 33, 39, 30, 41, 32, 37, 31, 29, 30,
                   40, 28, 23, 36, 41, 35, 40, 40, 27, 37, 41, 33, 37, 19, 41, 29, 30, 40,
                   40, 32, 24, 36, 40, 28, 23, 32, 37, 22, 12, 40, 36, 38, 34, 37, 36, 39,
                   35, 39, 35, 41, 22, 34, 37, 36, 38, 33, 37, 30, 41, 32, 37, 31, 36, 41], dtype=torch.int32, device='cpu')

output = onnx_pymodel.forward(text_tensor)
print(output['feat_gen'].shape)
np.savetxt('onnx_py_feat_gen.txt', output['feat_gen'].cpu().numpy(), fmt='%.2f', delimiter=',')
Masao-Someki commented 5 months ago

Thank you for reporting this issue. I will look into it this weekend...

hexisyztem commented 5 months ago

https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/export/tts/models/tts_models/fastspeech2.py#L203 Okay, and I've checked the specific computational values and found that the errors only exist in the LengthRegulator module. The calculations in the rest of the FastSpeech2 model before and after this module are consistent.

hexisyztem commented 5 months ago

https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/export/asr/models/layers/embed.py#L184 here should be

self.alpha = model.alpha

Masao-Someki commented 5 months ago

I ran your code above, and it seems feat_gen matches between Torch and ONNX. So I think the OnnxFastSpeech2 does not have any issues. But if you got different result for wav, vocoder side may have some issue. I will investigate more.

BTW this repository assume batch_size=1. If you run LengthRegulator with multiple batches, you will get different results for additional batches.

Masao-Someki commented 5 months ago

I noticed that the feat_gen from the converted model is denormalized, unlike in the Torch version where the denormalized features are stored in feat_gen_denorm. I also checked the wav, and it sounds very similar.

Could you please check again to see which results do not match?

hexisyztem commented 5 months ago

You haven't made any changes? Including the issue with self.alpha = model.alpha? I will verify again shortly and reply to you in about 20 minutes. I will keep in touch via email.

hexisyztem commented 5 months ago
import torch
import numpy as np
from espnet2.bin.tts_inference import Text2Speech
# Set the paths for the model file and configuration file
model_path = "/workspace/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/train.loss.ave_5best.pth" # File path
config_path = "/workspace/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/config.yaml" # Configuration file

model = Text2Speech.from_pretrained(model_tag=None,train_config=config_path, model_file=model_path, device='cpu')

text_tensor = torch.tensor([25, 41, 34, 41, 28, 39, 25, 41, 41, 35, 41, 34, 41, 35, 39, 36, 40, 40,
                   39, 40, 30, 40, 22, 34, 37, 39, 30, 41, 32, 24, 26, 41, 29, 41, 35, 41,
                   34, 41, 26, 41, 32, 24, 36, 40, 28, 23, 21, 39, 36, 41, 37, 30, 37, 35,
                   41, 35, 40, 26, 41, 36, 41, 31, 18, 39, 33, 41, 33, 37, 30, 41, 32, 24,
                   29, 32, 40, 33, 37, 26, 41, 28, 41, 36, 40, 40, 34, 37, 36, 39, 35, 41,
                   13, 38, 41, 31, 36, 41, 33, 41, 35, 41, 35, 40, 36, 41, 29, 32, 40, 33,
                   37, 34, 40, 30, 40, 25, 41, 34, 41, 28, 39, 35, 39, 34, 41, 39, 32, 38,
                   33, 38, 12, 40, 40, 30, 39, 26, 41, 35, 41, 39, 36, 41, 33, 41, 35, 41,
                   35, 40, 36, 41, 29, 25, 41, 36, 41, 33, 39, 30, 41, 32, 37, 31, 29, 30,
                   40, 28, 23, 36, 41, 35, 40, 40, 27, 37, 41, 33, 37, 19, 41, 29, 30, 40,
                   40, 32, 24, 36, 40, 28, 23, 32, 37, 22, 12, 40, 36, 38, 34, 37, 36, 39,
                   35, 39, 35, 41, 22, 34, 37, 36, 38, 33, 37, 30, 41, 32, 37, 31, 36, 41], dtype=torch.int32, device='cpu')

from espnet_onnx.export.tts.models.tts_models.fastspeech2 import OnnxFastSpeech2
export_config = dict(
    opset_version=13,
    max_seq_len=2048,
)
model2 = Text2Speech.from_pretrained(model_tag=None,train_config=config_path, model_file=model_path, device='cpu')
onnx_pymodel = OnnxFastSpeech2(model2.model.tts, **export_config)

################ 

onnx_output = onnx_pymodel.forward(text_tensor)
espnet_output = model.model.tts.inference(text_tensor)
onnx_output['feat_gen']
espnet_output['feat_gen']

After modifying self.alpha = model.alpha, the feat_gen indeed got fixed. I will convert it to .onnx again for verification.

hexisyztem commented 5 months ago
import onnxruntime as ort
sess = ort.InferenceSession("/root/.cache/espnet_onnx/20240519_191620/full/fast_speech2.onnx")
input_name = sess.get_inputs()[0].name
output_name1 = "feat_gen"
output_name2 = "out_duration"
import numpy as np
input_data = np.array([25, 41, 34, 41, 28, 39, 25, 41, 41, 35, 41, 34, 41, 35, 39, 36, 40, 40,
                   39, 40, 30, 40, 22, 34, 37, 39, 30, 41, 32, 24, 26, 41, 29, 41, 35, 41,
                   34, 41, 26, 41, 32, 24, 36, 40, 28, 23, 21, 39, 36, 41, 37, 30, 37, 35,
                   41, 35, 40, 26, 41, 36, 41, 31, 18, 39, 33, 41, 33, 37, 30, 41, 32, 24,
                   29, 32, 40, 33, 37, 26, 41, 28, 41, 36, 40, 40, 34, 37, 36, 39, 35, 41,
                   13, 38, 41, 31, 36, 41, 33, 41, 35, 41, 35, 40, 36, 41, 29, 32, 40, 33,
                   37, 34, 40, 30, 40, 25, 41, 34, 41, 28, 39, 35, 39, 34, 41, 39, 32, 38,
                   33, 38, 12, 40, 40, 30, 39, 26, 41, 35, 41, 39, 36, 41, 33, 41, 35, 41,
                   35, 40, 36, 41, 29, 25, 41, 36, 41, 33, 39, 30, 41, 32, 37, 31, 29, 30,
                   40, 28, 23, 36, 41, 35, 40, 40, 27, 37, 41, 33, 37, 19, 41, 29, 30, 40,
                   40, 32, 24, 36, 40, 28, 23, 32, 37, 22, 12, 40, 36, 38, 34, 37, 36, 39,
                   35, 39, 35, 41, 22, 34, 37, 36, 38, 33, 37, 30, 41, 32, 37, 31, 36, 41], dtype=np.int64)
onnx_outputs = sess.run(None, {input_name: input_data})
################################ 

import torch
import numpy as np
from espnet2.bin.tts_inference import Text2Speech

model_path = "/workspace/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/train.loss.ave_5best.pth"  
config_path = "/workspace/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/config.yaml"  

model = Text2Speech.from_pretrained(model_tag=None,train_config=config_path, model_file=model_path, device='cpu')
text_tensor = torch.tensor(input_data, dtype=torch.int32, device='cpu')

espnet_output = model.model.tts.inference(text_tensor)
espnet_output['feat_gen']
onnx_outputs[0]

The values indeed align now.

hexisyztem commented 5 months ago
image

There are small numerical fluctuations, but I believe they are within the normal range.

hexisyztem commented 5 months ago

https://github.com/espnet/espnet_onnx/pull/112

Masao-Someki commented 4 months ago

@hexisyztem Sorry for the late reply!! Thanks for the confirmation, I will merge the #112 !