NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.61k stars 3.24k forks source link

fastpitch notebooks not producing audible speech #956

Closed sciai-ai closed 3 years ago

sciai-ai commented 3 years ago

Hi, the notebook demo for fastpitch is not working, the audio generated is purely noise. Can you please check?

Thank you

alancucki commented 3 years ago

Hi @sciai-ai ,

thanks for raporting this! I've just checked both notebooks and the paths to checkpoints need to be updated:

fastp = '../pretrained_models/nvidia_fastpitch_200518.pt'
waveg = '../pretrained_models/nvidia_waveglow256pyt_fp16.pt'

Other than that, the code works well.

Could you please check again with those checkpoints? If you're getting noise, chances are that an unconverged model is being loaded instead.

sciai-ai commented 3 years ago

Thanks @alancucki I managed to get it working with the pretrained models.

I also trained a tacotron model followed by the fastpitch model on my own dataset (>50 hrs). The speech with both tacotron and fastpitch (1500 epochs each) is audible however the several words in the sentence have poor quality, any ideas on how can I improve this?

On a side note that I used espnet on the same custom dataset before and their tacotron model gives much better results. However I want to add customiztions offered by fastpitch :)

alancucki commented 3 years ago

Great to hear that!

What kind of quality issues do you experience? Slurred speech points to poor alignments, mispronunciation to poor grapheme generalization, and small artifacts might come from WaveGlow.

We're about to update FastPitch with an aligning mechanism that relieves it from using Tacotron 2, and delivers better quality. Stay tuned!

sciai-ai commented 3 years ago

I think it's the case of poor alignments.

Any ideas when can we expect the new feature. Also would phoneme based training be possible too?

alancucki commented 3 years ago

Yes, we plan to support both phonemes and graphemes. This should be on-line by the end of June.

sciai-ai commented 3 years ago

that's great @alancucki. Have you checked the inference performance for longer texts, such as, paragraphs with greater than 3K characters? and whether cpu inference is supported?

alancucki commented 3 years ago

CPU inference is supported by inference.py (just skip the --cuda flag).

For longer paragraphs, it's better to split them by sentence -- training data is limited by duration due to mem constraints, and the model is unlikely to generalize.

DanRuta commented 3 years ago

Yes, we plan to support both phonemes and graphemes. This should be on-line by the end of June.

Hello, do you have any eta for the update? I'm eager and very excited to take this for a spin when it's ready! The built-in alignment mechanism sounds amazing.

alancucki commented 3 years ago

Hey @DanRuta , the model and the recipe for LJ is ready - nowe we're just updating performance measurements. Stay tuned :)

EmElleE commented 3 years ago

@alancucki did you ever get around to getting phones and graphemes work done? if so when can we expect this to come out? and it will be considered fast pitch 2?