facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.42k stars 6.4k forks source link

Textless NLP / GSLM: Speech resynthesis produces something unrelated to source speech #3970

Closed osyvokon closed 3 years ago

osyvokon commented 3 years ago

What is your question?

As far as I understand, examples/textless_nlp/gslm/tools/resynthesize_speech.py should take a speech sample (audio), encode it to units, and generate output speech from these units. The output speech should resemble the input sample.

However, when I do this with the released pre-trained models, output is gibberish that doesn't sound like input at all.

I attach the samples and steps I took. Is there anything I do is wrong?

Thank you!

Code

  1. Download pre-trained models (HuBERT-km200 in this example):

    mkdir -p /content/speech/hubert200
    cd /content/speech/hubert200
    wget https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt -nc 
    wget https://dl.fbaipublicfiles.com/textless_nlp/gslm/hubert/km200/km.bin -nc
    wget https://dl.fbaipublicfiles.com/textless_nlp/gslm/hubert/tts_km200/tts_checkpoint_best.pt -nc 
    wget https://dl.fbaipublicfiles.com/textless_nlp/gslm/waveglow_256channels_new.pt -nc
  2. Generate the code_dict.txt file. I didn't find "official" description of how to do it, so I used this comment. Note that if I use dict of size 199 or 200, the models will fail

with open("code_dict.txt", "wt") as f:
    for i in range(1, 199):   # Effectively 198 items
        f.write(str(i) + "\n")
  1. Download and convert source audio sample from the speech resynthesis example site:

    wget https://speechbot.github.io/resynthesis/audio/teaser/p269_182.mp3 -nc
    ffmpeg -y -i p269_182.mp3 sample.input.wav
  2. Run resynthesis:

    
    export FAIRSEQ_ROOT=/home/ubuntu/fairseq
    export DATA=/content/speech/hubert200
    export TYPE=hubert

echo sample.input.wav > input.txt echo sample.out.layer5.wav >> input.txt

PYTHONPATH=${FAIRSEQ_ROOT}:${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/unit2speech python ${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/tools/resynthesize_speech.py \ --feature_type $TYPE \ --layer 5 \ --acoustic_model_path $DATA/hubert_base_ls960.pt \ --kmeans_model_path $DATA/km.bin \ --tts_model_path $DATA/tts_checkpoint_best.pt \ --code_dict_path $DATA/code_dict.txt \ --waveglow_path $DATA/waveglow_256channels_new.pt \ --max_decoder_steps 1000 < input.txt



5. Check the result (in the [attachement](https://github.com/pytorch/fairseq/files/7390879/samples.zip)
). It doesn't sound like the original audio at all.

#### What have you tried?

I tried to run resynthesis with different number of units, taking different HuBERT layer for features, different audio, and different offsets for `code_dict.txt` 

In addition to steps outlined above, I tried to generate speech with `units2speech` directly from units in devset. It still produces gibberish. This makes me think that the problem may lie in bad pre-trained tts checkpoint.

#### What's your environment?

 - fairseq Version (e.g., 1.0 or main): main
 - PyTorch Version (e.g., 1.0) 1.9.1
 - OS (e.g., Linux): Ubuntu 18.04
 - How you installed fairseq (`pip`, source): source
 - Build command you used (if compiling from source): `pip install -e .`
 - Python version: 3.7.0
 - CUDA/cuDNN version: cuda_11.1.TC455_06.29190527_0
 - GPU models and configuration: Tesla V100-SXM2
 - Any other relevant information:

[samples.zip](https://github.com/pytorch/fairseq/files/7390879/samples.zip) contains generates samples - both audio and units.
orcsun commented 3 years ago

thank you asivokon for posting this issue. i ran into the same. seems like missing code_dict.txt file is doing some magic mappings which should not just be a seq of index from 1 to N.

Uncomfy commented 3 years ago

Most of the TTS models I've tried have smaller embedding layer than the amount of units in respective K-means models (which is strange, since TTS adds one symbol for padding and one for EOS, so it should be bigger). One of the models that has same sizes in K-means and TTS embeddings is HuBERT + KM50, but it still produces gibberish for me if I use dictionary with numbers from 1 to 50 or 0 to 49.

osyvokon commented 3 years ago

@hikushalhere, could you please help with this?

eugene-kharitonov commented 3 years ago

Hello, @asivokon , Thanks for your interest to our work!

  1. Please use Hubert layer 6, I believe the pre-trained checkpoints assume this.
  2. I can confirm that code_dict for eg Hubert100 is just a text file with numbers [0...99] inclusive.
  3. [Here] I've included a code file and a resynthesis output for your input example obtained using vocab-100 model.

Please let me know if this solves your issue.

bradgrimm commented 3 years ago

@eugene-kharitonov No luck for me. If I use your code_dict it crashes.

RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

From what I can tell the hubert km 100 file has n_symbols=101. Which means 99 numbers (0-98) plus the pad token and the end of sentence token. If I remove one number (either from the beginning or end) it does not crash anymore, but I get gibberish.

To make it even more confusing the cpc file has n_symbols=102. So it has one more symbol than hubert (and should work with your file), but your file also has an empty newline at the end causing it to crash (not sure if that is intentional). If I remove the newline it stops crashing, I get gibberish with the CPC models too.

Are you certain the uploaded files are correct? I'm downloading the following files: Acoustic: https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt SPU: https://dl.fbaipublicfiles.com/textless_nlp/gslm/hubert/km100/km.bin UTS: https://dl.fbaipublicfiles.com/textless_nlp/gslm/hubert/tts_km100/tts_checkpoint_best.pt Waveglow: https://dl.fbaipublicfiles.com/textless_nlp/gslm/waveglow_256channels_new.pt

Here's the exact command I'm using:

PYTHONPATH=${FAIRSEQ_ROOT}:${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/unit2speech python ${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/tools/resynthesize_speech.py \
    --feature_type hubert \
    --acoustic_model_path /mnt/large/data/pretrained/hubert_base_ls960.pt \
    --layer 6 \
    --kmeans_model_path /mnt/large/data/pretrained/km_100.bin \
    --tts_model_path /mnt/large/data/pretrained/hubert_base_km100.pt \
    --code_dict_path /mnt/large/data/pretrained/code_dict_100 \
    --waveglow_path /mnt/large/data/pretrained/waveglow_256channels_new.pt \
    --max_decoder_steps 2000
osyvokon commented 3 years ago

@bradgrimm beat me to it :)

@eugene-kharitonov, thank you for your input. Unfortunately, the code either fails (when using the code_100 file you provided), or produces gibberish (when code_100 is reduced by one from any end).

The comment above nicely describes the same steps and struggles I have. I'll just add few details to it.

Here are hashes of the downloaded hubert-100 models:

$ md5sum *.pt *.bin
badba6cc1805a58422232d7b2859d8a1  hubert_base_ls960.pt
5c0ee2869b4f483d17f37f1a41a548e0  tts_km100.pt
b507a2716bb5174904762012a5df385d  waveglow_256channels_new.pt
6580f2afcddac54e8233e3ba0eba0677  km.bin

Also, when using the original code_100, the model clearly crashes with index error:

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [307,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

That makes sense, since after adding the padding and EOS symbols, units for the encoded sound look like this:

tensor([ 72,  13,  64,   ......,  21, 101])
                                      ~~~
eugene-kharitonov commented 3 years ago

@asivokon & @bradgrimm - thanks for helping me to debug this issue. I was hoping this is due to using a different layer, but it does seem to be something else.

Can you please give a try to this TTS checkpoint (corresponding to Hubert, km100, layer 6, md5 df4a9c6ffd1bb00c91405432c234aba3). I confirm that other files have the same checksums as those I use. code-dict is 0..99 with no empty/newline at the end. I use fairseq main corresponding to the commit id c203097814ed1d1dcbc5cadcc7fa89072c96f361.

For your input file the tts_input tensor is tensor([ 72, 13, 64, 9, 64, 9, 64, 9, 35, 68, 55, 41, 64, 94, 87, 10, 66, 100, 28, 48, 12, 46, 65, 81, 86, 54, 45, 19, 28, 64, 32, 37, 60, 46, 70, 66, 4, 25, 14, 59, 45, 81, 27, 43, 20, 30, 29, 42, 11, 38, 87, 45, 86, 10, 17, 4, 100, 93, 90, 79, 88, 25, 62, 38, 74, 17, 51, 88, 59, 10, 86, 20, 82, 84, 11, 21, 101])

osyvokon commented 3 years ago

@eugene-kharitonov, the updated checkpoint df4a9c6f works great!

Having spent over a week in futile attempts to reproduce the results, this newly generated sample sounds like heavenly music to my ears -- just can't stop listening to it! :)

Now, all other checkpoints I tried (hubert200, hubert500, logmel-100) had the same problem with generating gibberish. Could you please double check if those files (likely, all other TTS checkpoints) should be re-released as well?

Thanks a lot for an impressive piece of work!

eugene-kharitonov commented 3 years ago

Oh, great that it works, thanks for checking! I'll verify the rest of the checkpoints.

bradgrimm commented 3 years ago

@eugene-kharitonov It works for me too. Thank you! Let me know when you get the Hubert 500 fixed, that's the one I'm most interested in.

eugene-kharitonov commented 3 years ago

@asivokon I've updated TTS checkpoints + provided code_dict files and manually verified that a few of the checkpoints work. @bradgrimm unfortunately, it seems we don't have a good Hubert500 model. As those were not used in the paper, we decided not to support the case of 500 unit models. Sorry about the confusion.

Thanks for your help!

KaushalNaresh commented 1 year ago

Hi,

I am also re synthesizing audio using the same checkpoints but I am not getting correct results. My generated audio seems truncated.

I have attached google drive link for the original and generated audio. If someone knows what's the issue please help me out.

Thanks

Link