NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
887 stars 177 forks source link

How would one keep the model loaded for immediate synthesis? #143

Closed Jcwscience closed 2 years ago

Jcwscience commented 2 years ago

I am trying to use the inference script as a base for my own script. It loads the model and acts as a server, so when I send a text item to it the audio is generated and played in real-time. The problem is that I am getting confused at where to split the script between the while loop and the startup.

Where in this can I move the text input in the snippet below into a while loop?

# Load Flowtron
# Load Waveglow

ignore_keys = ['training_files', 'validation_files']
    trainset = Data(
        data_config['training_files'],
        **dict((k, v) for k, v in data_config.items() if k not in ignore_keys))
    speaker_vecs = trainset.get_speaker_id(speaker_id).cuda()
    text = trainset.get_text(text).cuda()
    speaker_vecs = speaker_vecs[None]
    text = text[None]

# Do the actual inference
# Do something with the audio data

I know the inference needs to run each time new text is received, and obviously playing the audio as well. And I know the Flowtron and Waveglow models only need to be loaded once. So what is the code in the snippet doing, and how can I divide this up?

Bahm9919 commented 2 years ago

Did you solved it? How its going?

Jcwscience commented 2 years ago

@Bahm9919 More or less, I realized the speaker vecs were what was taking the majority of the time, so I just moved the two mentions of text into the while loop and inference is almost instantaneous.

Jcwscience commented 2 years ago

@Bahm9919 I realized the answer only a few minutes after I made the post. Although now I am having a different difficulty. I am trying to play the audio live using Python SoundDevice but the output array only reads correctly at 11Khz. And it is badly distorted. I thought the standard sample rate was 22050? Writing the array to a wave file works fine though.

Bahm9919 commented 2 years ago

@Bahm9919 More or less, I realized the speaker vecs were what was taking the majority of the time, so I just moved the two mentions of text into the while loop and inference is almost instantaneous.

How about synthesis with your voice? Did you get good results?

Jcwscience commented 2 years ago

Not exactly. Any attempt I made with the ljs model had a stuttering problem, so I tried the libritts model with a speaker id of 0. At about 10 or so epochs it starts to sound like me, but allowing it to continue training results in a degradation to static or screams within 10 or 15 more epochs. It’s a little baffling.

Bahm9919 commented 2 years ago

@Bahm9919 I realized the answer only a few minutes after I made the post. Although now I am having a different difficulty. I am trying to play the audio live using Python SoundDevice but the output array only reads correctly at 11Khz. And it is badly distorted. I thought the standard sample rate was 22050? Writing the array to a wave file works fine though.

Yes. But i'm not doing this yet. During 3 months working with this project, training and trying get good results, ive got it only today. Now will try do something like you do.

Bahm9919 commented 2 years ago

Not exactly. Any attempt I made with the ljs model had a stuttering problem, so I tried the libritts model with a speaker id of 0. At about 10 or so epochs it starts to sound like me, but allowing it to continue training results in a degradation to static or screams within 10 or 15 more epochs. It’s a little baffling.

How many flows you did? Did you train n_flow2? did you train with cummulative attention?

Jcwscience commented 2 years ago

@Bahm9919

server2.zip

This is the continuously running script. If it helps

As for the training... To be honest I'm just winging it based off of the readme file. I still barely understand how the whole system works, let alone specific parameters.

Bahm9919 commented 2 years ago

I've got good results with these steps.

Jcwscience commented 2 years ago

Ok I'll give that a try!

Bahm9919 commented 2 years ago

Ok I'll give that a try!

Fine-tuning which in README didn't work for me. So after done everything, ive trained from scratch the voice which i want. And it gave me better results. However you maybe need more data. I had 1.5h.

Jcwscience commented 2 years ago

What are the specs of your machine? I am rather limited in processing power at the moment.

Bahm9919 commented 2 years ago

What are the specs of your machine? I am rather limited in processing power at the moment.

I'm using google colab pro)))

Jcwscience commented 2 years ago

Ahhhhh, I had forgot about the pro account. If I can put together enough training data that might just work! Would you happen to have any colab files you could share?

Bahm9919 commented 2 years ago

Ahhhhh, I had forgot about the pro account. If I can put together enough training data that might just work! Would you happen to have any colab files you could share?

I would like to share, but i don't have, as you mentioned before, you don't understand how these scripts working, but i tell you more, i don't understand nothing with programming. I'm using Inference demo script for inference output.

Jcwscience commented 2 years ago

Ok Thanks for the advice anyways! I'm sure I can work it out if I spend enough time on it. And if I get the Live inference script cleaned up and working I'll let you know

Bahm9919 commented 2 years ago

Ok Thanks for the advice anyways! I'm sure I can work it out if I spend enough time on it. And if I get the Live inference script cleaned up and working I'll let you know

Thanks for script, I understand your passion. You can write me to telegram @harmonicas if you need help.