NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
887 stars 177 forks source link

Low resource language setup. #118

Open michael-conrad opened 3 years ago

michael-conrad commented 3 years ago

I am wanting to setup a TTS for the Cherokee language.

I'm wanting to do this to try and preserve an endangered language by leveraging TTS to create audio lesson materials. I am not a machine learning expert or researcher. My main focus is language preservation.

The audio I have is largely single word utterances from multiple speakers with various degrees of quality.

Is there a good (step by step) guideline somewhere for fine tuning one of the published models with a low resource language?

How would I need to arrange the data I have to fit within the existing data loader setup?

I have previously been trying to get https://github.com/Tomiinek/Multilingual_Text_to_Speech working but have only had partial success to date.

Any help would be greatly appreciated.

FYI: I don't have permission to release my training data publicly at this time.

deepglugs commented 3 years ago

in my experience, single-word is hard to train. If there is a way to combine your data into longer utterances, the better.

rafaelvalle commented 3 years ago

michael, thank you for doing this for this language.

even though it's possible to train flowtron by using single word sentences, it's very unlikely that you'll be able to generate sentences with more than a single-word.

michael-conrad commented 3 years ago

in my experience, single-word is hard to train. If there is a way to combine your data into longer utterances, the better.

I do have some longer utterances.

michael-conrad commented 3 years ago

michael, thank you for doing this for this language.

even though it's possible to train flowtron by using single word sentences, it's very unlikely that you'll be able to generate sentences with more than a single-word.

Right now my primary goal is to create audio which matches dictionary entries and to create challenge/response audio lesson materials.

rafaelvalle commented 3 years ago

how many hours of data do you have?

deepglugs commented 3 years ago

michael, thank you for doing this for this language.

even though it's possible to train flowtron by using single word sentences, it's very unlikely that you'll be able to generate sentences with more than a single-word.

I may be wrong, but I seem to get an error here whenever I have batch size > 1 and what I assume to be a single-word utterance: flowtron.py:436

curr_x = F.dropout(
                        F.relu(conv(curr_x)),
                        0.5,
                        self.training)
  File "/home/kev/ai/src/flowtron_vanilla/flowtron.py", line 437, in forward
    F.relu(conv(curr_x)),
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/instancenorm.py", line 55, in forward
    return F.instance_norm(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 2077, in instance_norm
    _verify_batch_size(input.size())
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 2037, in _verify_batch_size
    raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 512, 1])

batch_size = 1 seems to be fine, but it looks like there's a separate code path for that.

michael-conrad commented 3 years ago

how many hours of data do you have?

Not really sure at the moment. I've neglected to have my processing scripts save that out as a stats file for review. And as I am in the middle of a training cycle and also that I'm reprocessing the audio I do have for noise reduction using the "denoiser" python pypi module along with transcription updates... not sure I can give any type of accurate guess beyond 2 to 3 hours maybe. I've recently obtained additional audio, but I haven't processed it yet for inclusion in the training data. Something else that I currently doing as part of the reprocessing is parsing out where some of my Cherokee audio data contains English with the same speaker and setting things up to add this data to the English side of things with the same voice tag. My understanding is this additional data should dramatically assist the model being able to disentangle voice vs language and result in a more robust and better "voice cloning" model. One goal that is mandatory is being able to have voices that aren't the original voices in the data speaking the language. Public use rights for likeness and all that.

Right now I'm doing a two language model with (https://github.com/CherokeeLanguage/Cherokee-TTS/).

This is being done on a GTX 1070. I've put myself on several waiting lists for a GTX 3090, but no luck so far.

Sample results

The best model I've gotten working so far actually has produced very useful audio. Though each utterance it generates has to be checked manually for quality of audio and correctness. See (https://www.cherokeelessons.com/content/cherokee-animal-names-tts-demo-audio-1/).

If I understand things right, the flowtron setup is less likely to produce a model with unwanted skips and repeats?

Some basic stats

Max utterance length: 10 seconds.

Utterances count:

en: 8,694
chr: 2,589

Utterance counts by speaker:

Speaker 01-chr: 267
Speaker 01-m-wwacc: 147
Speaker 02-chr: 272
Speaker 03-chr: 829
Speaker 04-chr: 124
Speaker 05-chr: 53
Speaker 08-chr: 49
Speaker 09-chr: 197
Speaker 294-en-f: 422
Speaker 297-en-f: 416
Speaker 299-en-f: 405
Speaker 300-en-f: 398
Speaker 301-en-f: 411
Speaker 305-en-f: 421
Speaker 306-en-f: 349
Speaker 308-en-f: 423
Speaker 310-en-f: 422
Speaker 311-en-m: 423
Speaker 318-en-f: 421
Speaker 329-en-f: 423
Speaker 330-en-f: 422
Speaker 333-en-f: 421
Speaker 334-en-m: 423
Speaker 339-en-f: 422
Speaker 341-en-f: 407
Speaker 345-en-m: 397
Speaker 360-en-m: 423
Speaker 361-en-f: 423
Speaker 362-en-f: 422
Speaker cno-f-chr_1: 6
Speaker cno-f-chr_2: 325
Speaker cno-f-chr_3: 58
Speaker cno-f-chr_5: 85
Speaker cno-m-chr_1: 88
Speaker cno-m-chr_2: 208

The English audio is American speakers from the CSTR-VCTK dataset.

rafaelvalle commented 3 years ago

@michael-conrad you can fine-tune your model with the recently added alignment framework. it should improve the attention mechanism.

michael-conrad commented 3 years ago

@michael-conrad you can fine-tune your model with the recently added alignment framework. it should improve the attention mechanism.

alignment framework?