enhuiz / vall-e

An unofficial PyTorch implementation of the audio LM VALL-E
MIT License
2.93k stars 417 forks source link

have someone ever tried this repo on other languages and got good performance? #38

Open MisakaMikoto96 opened 1 year ago

MisakaMikoto96 commented 1 year ago

have someone ever tried this repo on other languages and got good performance? 50 hours of toy data seem didn't get intelligibility.

airpdev commented 1 year ago

Hi, @MisakaMikoto96 Sorry but could you get the good result with English? I can not generate audio by using trained model with some data. Result is just noise not voice. Looking forward to your reply. Regards! Petar

MisakaMikoto96 commented 1 year ago

Hi, @MisakaMikoto96 Sorry but could you get the good result with English? I can not generate audio by using trained model with some data. Result is just noise not voice. Looking forward to your reply. Regards! Petar

I only tried it on a 1-hour nano mandarin data, and I set the input prompts to be itself(do not use self.sample_prompts in data.py), I got the human voice as its overfit in my dataset (test by input a transcription and its related audio, it is able to reconstruct the audio). The reproduction of @enhuiz ‘s work seems some different from the paper. May I ask why the sample_prompt is in data processing and only choose the qnt not the <phn, qnt>?

And also in the infere nce stage, the paper prefers to input "text_prompt" + "text_to_be_gen" + "audio_prompt", is any explanation in your code?

Really thanks for your work!

airpdev commented 1 year ago

Hi, @MisakaMikoto96 Thanks for your message. Sorry but would you like to share the code? (data.py, config.py, ar.yml) Looking forward to hearing from you. Best Regards! Petar

skysbird commented 1 year ago

i've the same confusion. But i found that this kind of work can let us infer from the target text, but not concat from the prefix text+target text like the described in paper

skysbird commented 1 year ago

And also in the infere nce stage, the paper prefers to input "text_prompt" + "text_to_be_gen" + "audio_prompt", is any explanation in your code?

this can be lead a problem when your acoustic prompt are not consistent to the prefix text