MTG / WGANSing

Multi-voice singing voice synthesis
235 stars 44 forks source link

Building a extended singing corpus #28

Open ghost opened 3 years ago

ghost commented 3 years ago

I'm working on a corpus based on various sources to be trained with your model. I've some questions:

1) Is the dual read/sing version of the same words/songs revelant to the benefit of the produced audio? In the case of only singing songs and spoken audio (random words in dialog/interview) of the same singer are provided?

2) The current corpus is limited of 4 songs / 4 reads by singer. How many songs (or better what duration) by singer need to be provided to enhance the output signal, and what is the upper limit (approximatly) where the model will be (in all case) convergent during training?

3) I've notice that the NUS corpus is not correctly tuned in sens of amplitude of audio. Some records are lowest than other and the produced audio can vary greatly. Do you think it's a good idea to normalize the input to have a common loudness?

4) The NUS corpus is recorded dry (without reverberation), what is the impact of the reverb in this kind of model where the input is in any case decomposed by WORLD and the early reflection (decay of the reverb) had mainly impact on the F0.

5) About the F0, there's some new great algorithms like CREPE or more recently SPICE (Google AI). Do you think it's possible to combine the aperiodicity and spectral envelope of WORLD with third party F0 analyze or the process is intricated? As I see in the pyworld process the first call is DIO (the F0 estimation) before call StoneMask, CheapTrick and D4C. The F0 estimation of WORLD is clearly not the best and the F0 is crutial in our case.

Thank you so much if you'll find time to answer my questions.