Closed SwimyGreen closed 2 years ago
hello
Alright, thank you, this will help a lot. I will look into pyloudnorm for normalizing everything.
One more quick question I have is, is the network able to convert breathing sounds, coughs, voice cracks and other oddities that come up when talking? Would it be worth experimenting with that in the training data, or should I keep things as clean as possible?
I have not tested these situations. But I am worried about these non-speech parts causing unnatural voices. So I would suggest to keep training samples clean.
Alright, I will do my best to get clean takes then. Thank you for all your help.
I've gotten DYGANVC to train and inference properly using the VCC2020 dataset and I'm now trying to record audio for and put together my own 2 speaker dataset (of myself and a friend) to train.
I think I understand everything I need to follow specification-wise (and I've read #6), but I'm not 100% sure what should be in the audio files themselves to produce the best results:
1) Do both speakers need to say the same transcript for the training to work properly? If it's not necessary, does it still help or does it not matter?
2) Does it matter how much silence is in the audio files? If a person stops speaking for one second or so in the middle of the wav file, will that confuse the training?
3) Should the lengths of the audio files be relatively consistent? If most of the WAVs in my corpus end up being 1 to 5 seconds long, but I have one rambling 15 second long sentence to train, should I chop it into multiple clips or leave it as is?
4) Is there any benifit to expanding the corpus to more speakers even though I only need to inference between two of them? Or does that just add conflating variables?
5) If some of the WAV files have audible background noises while the speakers talk, does that interfere with training? (ie. would it train the algorithm to be more resilient to background sounds or would it just start mistaking those sounds for speech?)
6) What do you think the minimum number of minutes the corpus could be for each speaker while still producing mostly passable results? And at what point do diminishing returns start sinking in? (ie. Do you think there would be a significant quality improvement by having 30mins per speaker over having 10 or 15 minutes?)
7) Do I need to normalize my WAV training files to the same volume or does the algorithm handle that well?
Sorry for all the questions. Even if you can only answer some, it would be a great help, and hopefully other people will find it useful as well. Thank you.