Closed osyvokon closed 3 years ago
thank you asivokon for posting this issue. i ran into the same. seems like missing code_dict.txt file is doing some magic mappings which should not just be a seq of index from 1 to N.
Most of the TTS models I've tried have smaller embedding layer than the amount of units in respective K-means models (which is strange, since TTS adds one symbol for padding and one for EOS, so it should be bigger). One of the models that has same sizes in K-means and TTS embeddings is HuBERT + KM50, but it still produces gibberish for me if I use dictionary with numbers from 1 to 50 or 0 to 49.
@hikushalhere, could you please help with this?
Hello, @asivokon , Thanks for your interest to our work!
code_dict
for eg Hubert100 is just a text file with numbers [0...99] inclusive.Please let me know if this solves your issue.
@eugene-kharitonov No luck for me. If I use your code_dict it crashes.
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR
From what I can tell the hubert km 100 file has n_symbols=101
. Which means 99 numbers (0-98) plus the pad token and the end of sentence token. If I remove one number (either from the beginning or end) it does not crash anymore, but I get gibberish.
To make it even more confusing the cpc file has n_symbols=102
. So it has one more symbol than hubert (and should work with your file), but your file also has an empty newline at the end causing it to crash (not sure if that is intentional). If I remove the newline it stops crashing, I get gibberish with the CPC models too.
Are you certain the uploaded files are correct? I'm downloading the following files: Acoustic: https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt SPU: https://dl.fbaipublicfiles.com/textless_nlp/gslm/hubert/km100/km.bin UTS: https://dl.fbaipublicfiles.com/textless_nlp/gslm/hubert/tts_km100/tts_checkpoint_best.pt Waveglow: https://dl.fbaipublicfiles.com/textless_nlp/gslm/waveglow_256channels_new.pt
Here's the exact command I'm using:
PYTHONPATH=${FAIRSEQ_ROOT}:${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/unit2speech python ${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/tools/resynthesize_speech.py \
--feature_type hubert \
--acoustic_model_path /mnt/large/data/pretrained/hubert_base_ls960.pt \
--layer 6 \
--kmeans_model_path /mnt/large/data/pretrained/km_100.bin \
--tts_model_path /mnt/large/data/pretrained/hubert_base_km100.pt \
--code_dict_path /mnt/large/data/pretrained/code_dict_100 \
--waveglow_path /mnt/large/data/pretrained/waveglow_256channels_new.pt \
--max_decoder_steps 2000
@bradgrimm beat me to it :)
@eugene-kharitonov, thank you for your input. Unfortunately, the code either fails (when using the code_100
file you provided), or produces gibberish (when code_100
is reduced by one from any end).
The comment above nicely describes the same steps and struggles I have. I'll just add few details to it.
Here are hashes of the downloaded hubert-100 models:
$ md5sum *.pt *.bin
badba6cc1805a58422232d7b2859d8a1 hubert_base_ls960.pt
5c0ee2869b4f483d17f37f1a41a548e0 tts_km100.pt
b507a2716bb5174904762012a5df385d waveglow_256channels_new.pt
6580f2afcddac54e8233e3ba0eba0677 km.bin
Also, when using the original code_100
, the model clearly crashes with index error:
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [307,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
That makes sense, since after adding the padding and EOS symbols, units for the encoded sound look like this:
tensor([ 72, 13, 64, ......, 21, 101])
~~~
@asivokon & @bradgrimm - thanks for helping me to debug this issue. I was hoping this is due to using a different layer, but it does seem to be something else.
Can you please give a try to this TTS checkpoint (corresponding to Hubert, km100, layer 6, md5 df4a9c6ffd1bb00c91405432c234aba3). I confirm that other files have the same checksums as those I use. code-dict
is 0..99 with no empty/newline at the end. I use fairseq main corresponding to the commit id c203097814ed1d1dcbc5cadcc7fa89072c96f361.
For your input file the tts_input tensor is tensor([ 72, 13, 64, 9, 64, 9, 64, 9, 35, 68, 55, 41, 64, 94, 87, 10, 66, 100, 28, 48, 12, 46, 65, 81, 86, 54, 45, 19, 28, 64, 32, 37, 60, 46, 70, 66, 4, 25, 14, 59, 45, 81, 27, 43, 20, 30, 29, 42, 11, 38, 87, 45, 86, 10, 17, 4, 100, 93, 90, 79, 88, 25, 62, 38, 74, 17, 51, 88, 59, 10, 86, 20, 82, 84, 11, 21, 101])
@eugene-kharitonov, the updated checkpoint df4a9c6f
works great!
Having spent over a week in futile attempts to reproduce the results, this newly generated sample sounds like heavenly music to my ears -- just can't stop listening to it! :)
Now, all other checkpoints I tried (hubert200, hubert500, logmel-100) had the same problem with generating gibberish. Could you please double check if those files (likely, all other TTS checkpoints) should be re-released as well?
Thanks a lot for an impressive piece of work!
Oh, great that it works, thanks for checking! I'll verify the rest of the checkpoints.
@eugene-kharitonov It works for me too. Thank you! Let me know when you get the Hubert 500 fixed, that's the one I'm most interested in.
@asivokon I've updated TTS checkpoints + provided code_dict files and manually verified that a few of the checkpoints work. @bradgrimm unfortunately, it seems we don't have a good Hubert500 model. As those were not used in the paper, we decided not to support the case of 500 unit models. Sorry about the confusion.
Thanks for your help!
Hi,
I am also re synthesizing audio using the same checkpoints but I am not getting correct results. My generated audio seems truncated.
I have attached google drive link for the original and generated audio. If someone knows what's the issue please help me out.
Thanks
What is your question?
As far as I understand,
examples/textless_nlp/gslm/tools/resynthesize_speech.py
should take a speech sample (audio), encode it to units, and generate output speech from these units. The output speech should resemble the input sample.However, when I do this with the released pre-trained models, output is gibberish that doesn't sound like input at all.
I attach the samples and steps I took. Is there anything I do is wrong?
Thank you!
Code
Download pre-trained models (HuBERT-km200 in this example):
Generate the
code_dict.txt
file. I didn't find "official" description of how to do it, so I used this comment. Note that if I use dict of size 199 or 200, the models will failDownload and convert source audio sample from the speech resynthesis example site:
Run resynthesis:
echo sample.input.wav > input.txt echo sample.out.layer5.wav >> input.txt
PYTHONPATH=${FAIRSEQ_ROOT}:${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/unit2speech python ${FAIRSEQ_ROOT}/examples/textless_nlp/gslm/tools/resynthesize_speech.py \ --feature_type $TYPE \ --layer 5 \ --acoustic_model_path $DATA/hubert_base_ls960.pt \ --kmeans_model_path $DATA/km.bin \ --tts_model_path $DATA/tts_checkpoint_best.pt \ --code_dict_path $DATA/code_dict.txt \ --waveglow_path $DATA/waveglow_256channels_new.pt \ --max_decoder_steps 1000 < input.txt