Open souvikg544 opened 1 year ago
A dataset should include audio-text pair (text, audio), and then you could rename each text transcript and audio file as the repo does in data/test. The most important thing is that you should customize the g2p.py for the data language, to got the phoneme or other level sequence.
Is there any intuition around what should be average duration of audio samples for training? Also, the same for the prompt audio. Does the prompt audio have to be from the same speaker or it can be from any other speaker?
in the inference stage that prompt audio could be any unseen speaker or the same speaker as if we make sure the model have a robust performance in getting timbre information from any prompt audio.
and in the training stage? can the prompt be from a different speaker?
Can someone help me with understanding how can I build a custom dataset for this model to work ? Also if my pc has nvidia gpu and I install deepspeed can it still run?