facebookresearch / speech-resynthesis

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.
Other
390 stars 54 forks source link

Any pretrained models available? #1

Open MaxGodTier opened 3 years ago

MaxGodTier commented 3 years ago

Is there any chance to release your pretrained models for evaluation purposes? I'd like to make a few comparisons before training. Thank you!

adiyoss commented 3 years ago

Hi @MaxGodTier, We are working on it! We will update regarding this issue soon.

ahazeemi commented 3 years ago

Hi @adiyoss, Thank you for working on it. Can we get an update on this? Thanks!

apanasyuk commented 3 years ago

I recently read an article. This direction is very promising. I'd like to listen to my examples, especially hubert. Also will be grateful for pretrained model available.

adiyoss commented 2 years ago

Hi, Unfortunately, we are unable to release pre-trained models. We would be happy to assist with any issue regarding the published code / training the models. It should be pretty straight forward, please let us know in case you have any issues with it.

aereobert commented 2 years ago

Hi Yossi,

Thanks for your reply! We will try to reproduce the result based on the code released.

While trying to reproduce the result, we did encountered some problems and have some questions on training CPC100+LJSpeech. We will be grateful for any suggestions or insights.

  1. The paper proposed that the speaker embedding should be added in the HiFiGAN while synthesising the audio. But in the dataset folder we only see the discreate code for cpc/hubert/vqvae. Is this embedding being added somewhere else?

  2. We followed the guide in the readme and tried to train the model with cpc100+LJSpeech. For the F0 model, it converged on epoch 9 and we stopped training on epoch 29(Gen Loss Total : 0.043, s/b : 0.1409); For the resynthesis model, it converged on epoch 6 and we stopped training on epoch 33(Gen Loss Total : 37.948, Mel-Spec. Error : 0.601, s/b : 1.359). Then we used the inferenced.py to generate audio samples and it sounded very horrible(attached in the mail). Is there any suggestions we could use?

  3. We are still curious about how the code are generated. If we would like to generate discrete CPC code for a specific speech, should we directly use the code output from the code from fairseq/examples/textless_nlp/gslm/speech2unit https://github.com/pytorch/fairseq/tree/main/examples/textless_nlp/gslm/speech2unit ?

Thank you!

Best,

yy524 commented 2 years ago

@adiyoss , I tried train f0 model, but the loss is unnormal which almost don't decrease . Would you please show the loss of f0 model? Thank you very much.

stayforapple commented 2 years ago

Hi, Unfortunately, we are unable to release pre-trained models. We would be happy to assist with any issue regarding the published code / training the models. It should be pretty straight forward, please let us know in case you have any issues with it.

Why couldn't you release pre-trained models ? What a pity !