How should I prepare the speech in the *.wav files used for training?

SwimyGreen commented 2 years ago

I've gotten DYGANVC to train and inference properly using the VCC2020 dataset and I'm now trying to record audio for and put together my own 2 speaker dataset (of myself and a friend) to train.

I think I understand everything I need to follow specification-wise (and I've read #6), but I'm not 100% sure what should be in the audio files themselves to produce the best results:

1) Do both speakers need to say the same transcript for the training to work properly? If it's not necessary, does it still help or does it not matter?

2) Does it matter how much silence is in the audio files? If a person stops speaking for one second or so in the middle of the wav file, will that confuse the training?

3) Should the lengths of the audio files be relatively consistent? If most of the WAVs in my corpus end up being 1 to 5 seconds long, but I have one rambling 15 second long sentence to train, should I chop it into multiple clips or leave it as is?

4) Is there any benifit to expanding the corpus to more speakers even though I only need to inference between two of them? Or does that just add conflating variables?

5) If some of the WAV files have audible background noises while the speakers talk, does that interfere with training? (ie. would it train the algorithm to be more resilient to background sounds or would it just start mistaking those sounds for speech?)

6) What do you think the minimum number of minutes the corpus could be for each speaker while still producing mostly passable results? And at what point do diminishing returns start sinking in? (ie. Do you think there would be a significant quality improvement by having 30mins per speaker over having 10 or 15 minutes?)

7) Do I need to normalize my WAV training files to the same volume or does the algorithm handle that well?

Sorry for all the questions. Even if you can only answer some, it would be a great help, and hopefully other people will find it useful as well. Thank you.

MingjieChen commented 2 years ago

hello

It does not matter.
In theory, silence does not matter, but I have not tested it.
Lengths of training samples can be various, but minimum should be 1.28s. If one sample is shorter than 1.28s, it will be ignored during training.
I think more speakers will help.
Using samples with slight background noise for training is fine, but if you use a noisy sample for inference, the converted voice will be also noisy.
In the VCC2020 dataset, each speaker has 5min data, which is the minimum amount for getting a good voice. Yes, 30 min should be much better than 10 or 15 min
Normalizing loundness is necessary, but this algorithm does not do it. I would suggest pyloudnorm https://github.com/csteinmetz1/pyloudnorm

SwimyGreen commented 2 years ago

Alright, thank you, this will help a lot. I will look into pyloudnorm for normalizing everything.

One more quick question I have is, is the network able to convert breathing sounds, coughs, voice cracks and other oddities that come up when talking? Would it be worth experimenting with that in the training data, or should I keep things as clean as possible?

MingjieChen commented 2 years ago

I have not tested these situations. But I am worried about these non-speech parts causing unnatural voices. So I would suggest to keep training samples clean.

SwimyGreen commented 2 years ago

Alright, I will do my best to get clean takes then. Thank you for all your help.

MingjieChen / DYGANVC

How should I prepare the speech in the *.wav files used for training? #7