facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

MIT License

30.26k stars 6.38k forks source link

Text-to-Speech problem #4175

Open kormoczi opened 2 years ago

kormoczi commented 2 years ago

Hi, I am trying to use this model with fairseq: https://huggingface.co/facebook/tts_transformer-zh-cv7_css10 I am using the following code snippet for the model download and initialization:

from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
from fairseq.models.text_to_speech.hub_interface import TTSHubInterface
models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
    "facebook/tts_transformer-zh-cv7_css10",
    arg_overrides={"vocoder": "hifigan", "fp16": False}
)
model = models[0]
TTSHubInterface.update_cfg_with_data_cfg(cfg, task.data_cfg)
generator = task.build_generator(model, cfg)

But running this code, I just get an error: "TTSTransformerModel object is not subscriptable". If I replace model with models in the build_generator, like this:

generator = task.build_generator(models, cfg)

then the initialization will go through, but then I will get an error during the Text-to-Speech process. The code snippet for this part is the following:

text = "您好，这是试运行。"
sample = TTSHubInterface.get_model_input(task, text)
wav, rate = TTSHubInterface.get_prediction(task, model, generator, sample)

And I will get the following error within the get_prediction: "Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same".

Any suggestion or idea how I can get this work? Or where can I find an example? Thanks!

The environment is the following:

fairseq Version: main
PyTorch Version: 1.8.1+cu111
OS (e.g., Linux): Ubuntu 18.04
How you installed fairseq (pip, source): from source
Build command you used (if compiling from source): pip install --editable ./
Python version: 3.7.12
CUDA/cuDNN version: 11.1.1
GPU models and configuration: NVIDIA A100

kahne commented 2 years ago

@kormoczi Thanks very much for reporting the issue! We will take a look shortly.

kahne commented 2 years ago

Hi @kormoczi , can you add

from fairseq.utils import move_to_cuda
sample = move_to_cuda(sample) if torch.cuda.is_available() else sample

to move input tensors to GPU?

generator = task.build_generator(models, cfg) is the right one. Sorry for the confusion. We will update the code snippet accordingly.

kormoczi commented 2 years ago

Hi @kahne,

I have added this move_to_cuda() line, but I just received a new error:

RuntimeError: Input, output and indices must be on the current device

on the line TTSHubInterface.get_prediction().

I have tried to use this move_to_cuda() function on the other components as well (models, generator, etc.), but it did not helped, unfortunately.

Please advise, what shall I do! Thanks!

kormoczi commented 2 years ago

May I ask some help on this?

xegulon commented 2 years ago

Using your code snippet (with the above corrections), I get the following error: RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

The code used

from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
from fairseq.models.text_to_speech.hub_interface import TTSHubInterface
import IPython.display as ipd

models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
    "facebook/tts_transformer-fr-cv7_css10",
    arg_overrides={"vocoder": "hifigan", "fp16": False}
)

model = models[0]
TTSHubInterface.update_cfg_with_data_cfg(cfg, task.data_cfg)
generator = task.build_generator(models, cfg)

text = "Bonjour, ceci est un test."
sample = TTSHubInterface.get_model_input(task, text)
wav, rate = TTSHubInterface.get_prediction(task, model, generator, sample)

ipd.Audio(wav, rate=rate)

@kahne

xegulon commented 2 years ago

@kahne now I get this bug:

xegulon commented 2 years ago

And now this:

xegulon commented 2 years ago

Making some progress, but not too much: @kahne

xegulon commented 2 years ago

Finally this worked (the problem was related to IPython.display.Audio not liking the wav variable was on device cuda:0):

baiwei0703 commented 1 year ago

我在运行示例代码时，也发生了类似的问题

gswyhq commented 1 year ago

我在运行示例代码时，也发生了类似的问题

我是通过下面这样解决了： ` models, cfg, task = load_model_ensemble_and_task_from_hf_hub( "facebook/tts_transformer-zh-cv7_css10", arg_overrides={"vocoder": "hifigan", "fp16": False} ) model = models # 第一处改动，删除了[0] TTSHubInterface.update_cfg_with_data_cfg(cfg, task.data_cfg) generator = task.build_generator(model, cfg)

text = "您好，这是试运行。"

sample = TTSHubInterface.get_model_input(task, text) wav, rate = TTSHubInterface.get_prediction(task, models[0], generator, sample) # 第二处改动，model -> models[0] `

xcg340122 commented 1 year ago

运行没抛异常，但是没声

gswyhq commented 1 year ago

没声音的原因有多方面的，比如，播放的时候，音频播放器静音了；又或者合成的字符太短（一两个字符），或者其他地方的原因。若是用"facebook/tts_transformer-zh-cv7_css10"这个预训练模型的话，通过测试发现，合成字符为20~30个比较合适；若太短了，发现好多杂音；若合成太长，后面的被切断了，仅仅合成前面的部分字符。

marcraft2 commented 1 year ago

Replace: generator = task.build_generator(model, cfg) by: generator = task.build_generator([model], cfg)

Tomhegon commented 1 year ago

the following python code works after some modification：

from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub from fairseq.models.text_to_speech.hub_interface import TTSHubInterface import IPython.display as ipd from IPython.display import Audio from scipy.io.wavfile import write as write_wav

import ipywidgets as ipw from IPython.display import display, Audio import torch

models, cfg, task = load_model_ensemble_and_task_from_hf_hub( "facebook/tts_transformer-zh-cv7_css10", arg_overrides={"vocoder": "hifigan", "fp16": False} ) model = models[0].cuda()

model = models

TTSHubInterface.update_cfg_with_data_cfg(cfg, task.data_cfg)

generator = task.build_generator([model], cfg)

generator = task.build_generator([model], cfg)

text = ''' 新华社北京8月16日电（记者董雪、马卓言）针对近期少数西方政客和媒体称，中国经济增长放缓可能对全球经济发展构成风险。外交部发言人汪文斌16日在例行记者会上答问时说，这种论调有悖事实，中国经济持续恢复，总体回升向好，依然是世界经济增长的重要引擎。 '''

sample = TTSHubInterface.get_model_input(task, text) sample["net_input"]["src_tokens"] = sample["net_input"]["src_tokens"].cuda() sample["net_input"]["src_lengths"] = sample["net_input"]["src_lengths"].cuda() sample["speaker"] = sample["speaker"].cuda()

wav, rate = TTSHubInterface.get_prediction(task, model, generator, sample)

wav_cpu = wav.to('cpu')

ipd.Audio(wav_cpu.numpy(), rate=rate)

write_wav("audio1.wav", rate, wav_cpu.numpy())