DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.42k stars 160 forks source link

Batch Size #8

Closed michael-conrad closed 2 years ago

michael-conrad commented 2 years ago

Is it feasible to train the vocoder with a batch size of 6? I have a laptop with an 8 GB Ram GPU. Batch size 6 currently shows about 7GB RAM in use.

Flux9665 commented 2 years ago

I have honestly no idea, I never tried such small batchsizes. Considering spectrogram inversion is a pretty well-defined task the gradient estimate should be ok with few samples. I'm mostly worried about the super-resolution part, since that adds some complexity.

You should be able to see if it works pretty quickly though. There's no need to train until full convergence. Once the Mel-Loss goes below 10 the perceptual quality is about 95% of the way there, so you can usually stop after around 80k steps.

michael-conrad commented 2 years ago

Well... all was going good until I got this:

RuntimeError: stack expects each tensor to be equal size, but got [24576] at entry 0 and [24569] at entry 3

My modified dataloader is here: https://github.com/CherokeeLanguage/IMS-Toucan/blob/85996090ee729f32fbc1372f6a5eb268df4b8da6/TrainingInterfaces/Spectrogram_to_Wave/HiFIGAN/HiFiGANDataset_disk.py

Suggestions as for what to hunt for?

Flux9665 commented 2 years ago

24567 is the size that one spectrogram frame gets expanded to. I suspect when you downsample and upsample again to get the noisy data you sometimes don't end up with the original amount of frames again.

Flux9665 commented 2 years ago

Adding noise to the spectrogram and then removing it during vocoding is a cool idea, I've had it on my list of things to try for a while. There are also some speech enhancement losses that I also planned on adding, but I haven't had the time to properly test and check for improvements.

If you want to try it out, here are some suggestions:

Let me know if you try any of this and find good settings.

Flux9665 commented 2 years ago

Actually, I just pushed some changes to the HiFiGAN train loop. It should run faster now and the losses should be less unbalanced. I also added the additional losses I mentioned. They can be used optionally and are turned off by default for now. The adding of distortion to the getitem of the dataset is not done yet, but I think I might as well do that quickly. So if you want to try it all out, please tell me how it goes.

Flux9665 commented 2 years ago

Random corruption is added to the dataset, default is off.

The code is completely untested, it might even throw errors, I will test it later today but wanted to make this available to you quickly. If you end up trying any of this, let me know how it goes for you.

michael-conrad commented 2 years ago

24567 is the size that one spectrogram frame gets expanded to. I suspect when you downsample and upsample again to get the noisy data you sometimes don't end up with the original amount of frames again.

Ah. That makes sense. I'll need to add a check for that and adjust the max segment offset based on the smaller of the two mel spectrogram arrays.

michael-conrad commented 2 years ago

Actually, I just pushed some changes to the HiFiGAN train loop. It should run faster now and the losses should be less unbalanced. I also added the additional losses I mentioned. They can be used optionally and are turned off by default for now. The adding of distortion to the getitem of the dataset is not done yet, but I think I might as well do that quickly. So if you want to try it all out, please tell me how it goes.

Haven't had time to look it over yet. My noise currently is domain specific and is extracted from my source audio. My Cherokee data is generally of low quality. Some from tape which is why I'm using the 4k and 8k resampling as a source of noise. And I'm hoping that I can maybe get something at least acceptably decent from simply inverting some wax cylinder audio that contains Wyandot from one of the last native speakers. There is an effort to revitalize the language.

Flux9665 commented 2 years ago

Sounds like a really cool project. https://github.com/haoheliu/voicefixer could maybe also help with that.

I assume the vocoder training will go relatively smoothly, although I found since yesterday that adding too much noise to the input and then learn to reverse that can cause an already well trained model to diverge.

FYI, when you want to generate spectrograms, pretraining on larger datasets and then finetuning on the low-resource language will probably be the best option. The next version of the toolkit will contain a way to extract durations for Fastspeech that should work on very little tuning data, so no more need for knowledge distillation with Tacotron to get to a Fastspeech model. Fastspeech handles cross-lingual fine-tuning much better in my experience. The input to the models will be incompatible with your feature representation, but it doesn't matter since the new duration extraction has the same interface as the Tacotron duration extraction. So you can use the new method to get durations, but once the dataset cache is written, you can use the dataset cache with the old Fastspeech model.

michael-conrad commented 2 years ago

Have you looked at Unsupervised Duration Modelings - One TTS Alignment To Rule Them All (Badlani et al., 2021)?

https://github.com/keonlee9420/Comprehensive-Transformer-TTS#unsupervised-duration-modelings

Flux9665 commented 2 years ago

Yes, I tried it extensively, but no matter what I did, I couldn't get it to work. I tried it with the double prior they have in the paper, I tried it with the guided attention loss which is in the toolkit anyway, and I tried a bunch of schedules for the binarization loss and the alignment loss. I still use their monotonic alignment search to get cleaner durations, but either I did something completely wrong, or it only works on super clean data.

Anyway, I found a different solution that should be better suited to fine-tune models on little data, which is a key point in my research agenda. It's basically a tiny aligner model that's trained with CTC and integrates super easily into the workflow, inspired by this paper: https://arxiv.org/abs/2110.15792