Closed hexisyztem closed 4 weeks ago
By the way, I found that the results of espnet2.tts.fastspeech2.fastspeech2.FastSpeech2 and espnet_onnx.export.tts.models.tts_models.fastspeech2.OnnxFastSpeech2 are inconsistent. Logically, these two should be completely equivalent, with the only difference being that the former does not support tensor.onnx.export, while the latter does.
for example,
import torch
import numpy as np
from espnet2.bin.tts_inference import Text2Speech
# 设置模型文件和配置文件的路径
model_path = "/workspace/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/train.loss.ave_5best.pth" # 模型权重文件
config_path = "/workspace/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/config.yaml" # 配置文件
# 加载模型
model = Text2Speech.from_pretrained(model_tag=None,train_config=config_path, model_file=model_path, device='cpu')
# 要合成的文本
text = "私はあなたに好意を持っていますが、あなたが少し控えめなのが感じられます。それが社交的な不安からなのか、それとも私に対する興味がないからなのか、わかりません。もし可能であれば、もう少し積極的になってくれませんか?"
# 进行语音合成
with torch.no_grad():
wav = model(text)["wav"]
# 得到 python 的推理结果
from espnet_onnx.export.tts.models.tts_models.fastspeech2 import OnnxFastSpeech2
export_config = dict(
opset_version=12,
max_seq_len=2048,
)
onnx_pymodel = OnnxFastSpeech2(model.model.tts, **export_config)
text_tensor = torch.tensor([25, 41, 34, 41, 28, 39, 25, 41, 41, 35, 41, 34, 41, 35, 39, 36, 40, 40,
39, 40, 30, 40, 22, 34, 37, 39, 30, 41, 32, 24, 26, 41, 29, 41, 35, 41,
34, 41, 26, 41, 32, 24, 36, 40, 28, 23, 21, 39, 36, 41, 37, 30, 37, 35,
41, 35, 40, 26, 41, 36, 41, 31, 18, 39, 33, 41, 33, 37, 30, 41, 32, 24,
29, 32, 40, 33, 37, 26, 41, 28, 41, 36, 40, 40, 34, 37, 36, 39, 35, 41,
13, 38, 41, 31, 36, 41, 33, 41, 35, 41, 35, 40, 36, 41, 29, 32, 40, 33,
37, 34, 40, 30, 40, 25, 41, 34, 41, 28, 39, 35, 39, 34, 41, 39, 32, 38,
33, 38, 12, 40, 40, 30, 39, 26, 41, 35, 41, 39, 36, 41, 33, 41, 35, 41,
35, 40, 36, 41, 29, 25, 41, 36, 41, 33, 39, 30, 41, 32, 37, 31, 29, 30,
40, 28, 23, 36, 41, 35, 40, 40, 27, 37, 41, 33, 37, 19, 41, 29, 30, 40,
40, 32, 24, 36, 40, 28, 23, 32, 37, 22, 12, 40, 36, 38, 34, 37, 36, 39,
35, 39, 35, 41, 22, 34, 37, 36, 38, 33, 37, 30, 41, 32, 37, 31, 36, 41], dtype=torch.int32, device='cpu')
output = onnx_pymodel.forward(text_tensor)
print(output['feat_gen'].shape)
np.savetxt('onnx_py_feat_gen.txt', output['feat_gen'].cpu().numpy(), fmt='%.2f', delimiter=',')
Thank you for reporting this issue. I will look into it this weekend...
https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/export/tts/models/tts_models/fastspeech2.py#L203 Okay, and I've checked the specific computational values and found that the errors only exist in the LengthRegulator module. The calculations in the rest of the FastSpeech2 model before and after this module are consistent.
https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/export/asr/models/layers/embed.py#L184 here should be
self.alpha = model.alpha
I ran your code above, and it seems feat_gen matches between Torch and ONNX. So I think the OnnxFastSpeech2
does not have any issues.
But if you got different result for wav
, vocoder side may have some issue. I will investigate more.
BTW this repository assume batch_size=1. If you run LengthRegulator
with multiple batches, you will get different results for additional batches.
I noticed that the feat_gen from the converted model is denormalized, unlike in the Torch version where the denormalized features are stored in feat_gen_denorm. I also checked the wav, and it sounds very similar.
Could you please check again to see which results do not match?
You haven't made any changes? Including the issue with self.alpha = model.alpha? I will verify again shortly and reply to you in about 20 minutes. I will keep in touch via email.
import torch
import numpy as np
from espnet2.bin.tts_inference import Text2Speech
# Set the paths for the model file and configuration file
model_path = "/workspace/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/train.loss.ave_5best.pth" # File path
config_path = "/workspace/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/config.yaml" # Configuration file
model = Text2Speech.from_pretrained(model_tag=None,train_config=config_path, model_file=model_path, device='cpu')
text_tensor = torch.tensor([25, 41, 34, 41, 28, 39, 25, 41, 41, 35, 41, 34, 41, 35, 39, 36, 40, 40,
39, 40, 30, 40, 22, 34, 37, 39, 30, 41, 32, 24, 26, 41, 29, 41, 35, 41,
34, 41, 26, 41, 32, 24, 36, 40, 28, 23, 21, 39, 36, 41, 37, 30, 37, 35,
41, 35, 40, 26, 41, 36, 41, 31, 18, 39, 33, 41, 33, 37, 30, 41, 32, 24,
29, 32, 40, 33, 37, 26, 41, 28, 41, 36, 40, 40, 34, 37, 36, 39, 35, 41,
13, 38, 41, 31, 36, 41, 33, 41, 35, 41, 35, 40, 36, 41, 29, 32, 40, 33,
37, 34, 40, 30, 40, 25, 41, 34, 41, 28, 39, 35, 39, 34, 41, 39, 32, 38,
33, 38, 12, 40, 40, 30, 39, 26, 41, 35, 41, 39, 36, 41, 33, 41, 35, 41,
35, 40, 36, 41, 29, 25, 41, 36, 41, 33, 39, 30, 41, 32, 37, 31, 29, 30,
40, 28, 23, 36, 41, 35, 40, 40, 27, 37, 41, 33, 37, 19, 41, 29, 30, 40,
40, 32, 24, 36, 40, 28, 23, 32, 37, 22, 12, 40, 36, 38, 34, 37, 36, 39,
35, 39, 35, 41, 22, 34, 37, 36, 38, 33, 37, 30, 41, 32, 37, 31, 36, 41], dtype=torch.int32, device='cpu')
from espnet_onnx.export.tts.models.tts_models.fastspeech2 import OnnxFastSpeech2
export_config = dict(
opset_version=13,
max_seq_len=2048,
)
model2 = Text2Speech.from_pretrained(model_tag=None,train_config=config_path, model_file=model_path, device='cpu')
onnx_pymodel = OnnxFastSpeech2(model2.model.tts, **export_config)
################
onnx_output = onnx_pymodel.forward(text_tensor)
espnet_output = model.model.tts.inference(text_tensor)
onnx_output['feat_gen']
espnet_output['feat_gen']
After modifying self.alpha = model.alpha, the feat_gen indeed got fixed. I will convert it to .onnx again for verification.
import onnxruntime as ort
sess = ort.InferenceSession("/root/.cache/espnet_onnx/20240519_191620/full/fast_speech2.onnx")
input_name = sess.get_inputs()[0].name
output_name1 = "feat_gen"
output_name2 = "out_duration"
import numpy as np
input_data = np.array([25, 41, 34, 41, 28, 39, 25, 41, 41, 35, 41, 34, 41, 35, 39, 36, 40, 40,
39, 40, 30, 40, 22, 34, 37, 39, 30, 41, 32, 24, 26, 41, 29, 41, 35, 41,
34, 41, 26, 41, 32, 24, 36, 40, 28, 23, 21, 39, 36, 41, 37, 30, 37, 35,
41, 35, 40, 26, 41, 36, 41, 31, 18, 39, 33, 41, 33, 37, 30, 41, 32, 24,
29, 32, 40, 33, 37, 26, 41, 28, 41, 36, 40, 40, 34, 37, 36, 39, 35, 41,
13, 38, 41, 31, 36, 41, 33, 41, 35, 41, 35, 40, 36, 41, 29, 32, 40, 33,
37, 34, 40, 30, 40, 25, 41, 34, 41, 28, 39, 35, 39, 34, 41, 39, 32, 38,
33, 38, 12, 40, 40, 30, 39, 26, 41, 35, 41, 39, 36, 41, 33, 41, 35, 41,
35, 40, 36, 41, 29, 25, 41, 36, 41, 33, 39, 30, 41, 32, 37, 31, 29, 30,
40, 28, 23, 36, 41, 35, 40, 40, 27, 37, 41, 33, 37, 19, 41, 29, 30, 40,
40, 32, 24, 36, 40, 28, 23, 32, 37, 22, 12, 40, 36, 38, 34, 37, 36, 39,
35, 39, 35, 41, 22, 34, 37, 36, 38, 33, 37, 30, 41, 32, 37, 31, 36, 41], dtype=np.int64)
onnx_outputs = sess.run(None, {input_name: input_data})
################################
import torch
import numpy as np
from espnet2.bin.tts_inference import Text2Speech
model_path = "/workspace/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/train.loss.ave_5best.pth"
config_path = "/workspace/6bcf613d7d73d2ba1ec6508e6b9f1177/exp/tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk/config.yaml"
model = Text2Speech.from_pretrained(model_tag=None,train_config=config_path, model_file=model_path, device='cpu')
text_tensor = torch.tensor(input_data, dtype=torch.int32, device='cpu')
espnet_output = model.model.tts.inference(text_tensor)
espnet_output['feat_gen']
onnx_outputs[0]
The values indeed align now.
There are small numerical fluctuations, but I believe they are within the normal range.
@hexisyztem Sorry for the late reply!! Thanks for the confirmation, I will merge the #112 !
After converting the FastSpeech2 model with espnet_onnx, the audio generated by the model is distorted.
Using the model: kan-bayashi/jsut_fastspeech2
Download method:
python
Python inference method
Convert to ONNX format
ONNX inference method