Open sobomax opened 10 months ago
P.S. Just for the reference I can render 50 of those prompts in a batch with SpeechT5 in about 15 seconds wall time same HW (nothing too fancy, consumer-grade Intel A770). With first audio starting to come out in just 0.6 second on all 50 of them. So assuming semantic generation is batchable, by the time it completes I'd be already half-way through and some of them would already be done playing.
Here is my code if you are curious: https://github.com/sippy/Infernos/blob/main/HelloSippyTTSRT/HelloSippyRTPipe.py
Hi, thanks for reaching out. The inference speed requires TensorRT-LLM deployment. We are working out to have this open source in upcoming weeks.
Hey, first of all thanks for doing and publishing a great work!
But coming at the practical side, I am looking at rendering my favourite set of matrix quotes:
Inferring: args.text="As you can see, we've had our eye on you for some time now, Mister Anderson." generate_audio semantic_tokens: 3.829667329788208 self.featuredir / element_id_prompt=PosixPath('demo/audios/male_voice.wav') speaker_emb.shape=(1, 512) acoustic_tokens: 1.4472830295562744 vocoder time: 1.052943229675293 Inferring: args.text="It seems that you've been living two lives." generate_audio semantic_tokens: 1.5347039699554443 acoustic_tokens: 0.33491015434265137 vocoder time: 0.3351314067840576 Inferring: args.text="In one life, you're Thomas A Anderson, program writer for a respectable software company." generate_audio semantic_tokens: 4.294419050216675 acoustic_tokens: 1.0456955432891846 vocoder time: 1.1635217666625977 Inferring: args.text='You have a Social Security number, you pay your taxes, and you...help your landlady carry out her garbage.' generate_audio semantic_tokens: 5.344205617904663 acoustic_tokens: 1.0732522010803223 vocoder time: 1.2692646980285645 Inferring: args.text='The other life is lived in computers, where you go by the hacker alias Neo and are guilty of virtually every computer crime we have a law for.' generate_audio semantic_tokens: 6.878687143325806 acoustic_tokens: 1.191270112991333 vocoder time: 1.4288511276245117 Inferring: args.text='One of these lives has a future, and one of them does not.' generate_audio semantic_tokens: 3.3958096504211426 acoustic_tokens: 0.30650901794433594 vocoder time: 0.46129584312438965 Inferring: args.text='Have you ever stood and stared at it, marveled at its beauty, its genius? Billions of people just living out their lives, oblivious.' generate_audio semantic_tokens: 5.401077508926392 acoustic_tokens: 1.2113051414489746 vocoder time: 1.4591665267944336 Inferring: args.text='Did you know that the first Matrix was designed to be a perfect human world.' generate_audio semantic_tokens: 3.0609078407287598 acoustic_tokens: 0.4051539897918701 vocoder time: 0.49598193168640137 Inferring: args.text='Where none suffered.' generate_audio semantic_tokens: 1.1007585525512695 acoustic_tokens: 0.275745153427124 vocoder time: 0.3767883777618408 Inferring: args.text='Where everyone would be happy.' generate_audio semantic_tokens: 1.5478556156158447 acoustic_tokens: 0.3160576820373535 vocoder time: 0.45749568939208984 Inferring: args.text='It was a disaster.' generate_audio semantic_tokens: 1.159377098083496 acoustic_tokens: 0.2947394847869873 vocoder time: 0.3695690631866455 Inferring: args.text='No one would accept the program.' generate_audio semantic_tokens: 1.6103193759918213 acoustic_tokens: 0.29689860343933105 vocoder time: 0.3959059715270996 Inferring: args.text='Entire crops were lost.' generate_audio semantic_tokens: 1.473541021347046 acoustic_tokens: 1.0539555549621582 vocoder time: 1.0956377983093262 Inferring: args.text='Some believed that we lacked the programming language to describe your perfect world.' generate_audio semantic_tokens: 4.125181674957275 acoustic_tokens: 1.230463981628418 vocoder time: 1.5059840679168701 Inferring: args.text='But I believe that as a species, human beings define their reality through misery and suffering.'
The s2a is indeed quite fast, however t2s is absolutely horrible. With SpeechT5 I can get t2s in 100-200ms regardless of the prompt length and with the batch size of 50-100. I am missing something here?
t2s is a AR model, so it slow than s2a.
Hey, first of all thanks for doing and publishing a great work!
But coming at the practical side, I am looking at rendering my favourite set of matrix quotes:
The s2a is indeed quite fast, however t2s is absolutely horrible. With SpeechT5 I can get t2s in 100-200ms regardless of the prompt length and with the batch size of 50-100. I am missing something here?