facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.26k stars 6.39k forks source link

Textless NLP / GSLM: Speech resynthesis produces silent .wav output #4758

Open nonmetal opened 2 years ago

nonmetal commented 2 years ago

❓ Questions and Help

What is your question?

Hello, I'm currently repeating the tutorial, and struggling with a problem in which examples/textless_nlp/gslm/tools/resynthesize_speech.py is producing a file that is completely silent.

I don't think that the problem is happening during WaveGlow(Vocoder) step, as mel-spectrogram from Tacotron2 (var mel in /examples/textless_nlp/gslm/unit2speech/utils.py) shows no output. Also, it seems like that there is no problem in km.bin as it produces different length of sound file depending on the input file length.

I was not sure whether I'm having a dependency or package issue(such as CUDA), so I re-produced these steps with various environments. However both new environments using Anaconda(torch1.12.1+cuda11.3) and Google Colab(torch1.12.1+cuda10.1) showed the same result.

I'm attaching the input file, output file, and following mel-spectrogram output below. Do you have any assumption why the problem is happening?

Thanks a lot!

Code

  1. Downloaded pre-trained models from repo (HuBERT-km200 in this example)

    • acoustic model
    • k-means model
    • tts checkpoint model
    • code dict
    • vocoder (waveglow)
  2. get a sample voice file (LJspeech for this example) 84-121123-0005.flac

  3. in resynthesize_speech.py I added the code to plot mel-spectrogram:

import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Audio
import librosa
from torchaudio.utils import download_asset

def plot_spectrogram(specgram, title=None, ylabel="freq_bin"):
    fig, axs = plt.subplots(1, 1)
    axs.set_title(title or "Spectrogram (db)")
    axs.set_ylabel(ylabel)
    axs.set_xlabel("frame")
    im = axs.imshow(librosa.power_to_db(specgram), origin="lower", aspect="auto")
    fig.colorbar(im, ax=axs)
    plt.savefig('figure01.jpg')

while(True):
    # ~~~
    plot_spectrogram(mel[0].cpu().float().numpy(), title="MelSpectrogram - torchaudio", ylabel="mel freq")
    # ~~~
  1. run bash
    
    export FAIRSEQ_ROOT=/home/ubuntu/fairseq
    export DATA=/home/my/path/models

PYTHONPATH=${FAIRSEQ_ROOT}:${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/unit2speech python ${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/tools/resynthesize_speech.py \ --feature_type 'hubert' \ --layer 6 \ --acoustic_model_path $DATA/hubert_base_ls960.pt \ --kmeans_model_path $DATA/km.bin \ --tts_model_path $DATA/tts_checkpoint_best.pt \ --code_dict_path $DATA/code_dict.txt \ --waveglow_path $DATA/waveglow_256channels_new.pt \ --max_decoder_steps 2000



5. checked mel-spectrogram: no waveform is shown

  ![mel](https://github.com/nonmetal/fairseq/raw/main/mel_fail.jpg)

#### What have you tried?
I also tried with other models (CPC and wav2vec until now) and different K-Means models.
I also (and originally) tried to produce a wav file using `gslm/unit2speech/synthesize_audio_from_units.py`. It also shows same result: no sound output(silent). 

#### What's your environment?
(main environment)
- fairseq Version (e.g., 1.0 or main): main(0.12.2)
- PyTorch Version (e.g., 1.0) 1.12.1
- OS (e.g., Linux): Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-125-generic x86_64)
- How you installed fairseq (pip, source): `git clone https://github.com/pytorch/fairseq`
- Build command you used (if compiling from source): `pip install --editable ./`
- Python version: 3.9.13
- CUDA/cuDNN version: Build cuda_11.3.r11.3/compiler.29745058_0
- GPU models and configuration: Geforce GTX 3090 (NVIDIA Corporation Device 2204 (rev a1))
- Any other relevant information: - 
cywang97 commented 1 year ago

I have met the same problem. Have you solved it?

cywang97 commented 1 year ago

Hi, I find the waveglow model generates nan tensors, which leads to the silent output. I fixed this issue by using fp32. You can try to remove .half() in load_waveglow and load_tacotron functions. Hope this can help you.

nonmetal commented 1 year ago

Hi, I find the waveglow model generates nan tensors, which leads to the silent output. I fixed this issue by using fp32. You can try to remove .half() in load_waveglow and load_tacotron functions. Hope this can help you.

That method completely works! Thanks a lot for solving my problem 👍👍

Wannacry1234455900 commented 1 year ago

AXS