How should be the data in Vall E ?

enhuiz / vall-e

An unofficial PyTorch implementation of the audio LM VALL-E

MIT License

2.97k stars 419 forks source link

How should be the data in Vall E ? #39

Open souvikg544 opened 1 year ago

souvikg544 commented 1 year ago

Can someone help me with understanding how can I build a custom dataset for this model to work ? Also if my pc has nvidia gpu and I install deepspeed can it still run?

MisakaMikoto96 commented 1 year ago

A dataset should include audio-text pair (text, audio), and then you could rename each text transcript and audio file as the repo does in data/test. The most important thing is that you should customize the g2p.py for the data language, to got the phoneme or other level sequence.

agupta54 commented 1 year ago

Is there any intuition around what should be average duration of audio samples for training? Also, the same for the prompt audio. Does the prompt audio have to be from the same speaker or it can be from any other speaker?

MisakaMikoto96 commented 1 year ago

in the inference stage that prompt audio could be any unseen speaker or the same speaker as if we make sure the model have a robust performance in getting timbre information from any prompt audio.

agupta54 commented 1 year ago

and in the training stage? can the prompt be from a different speaker?