Why semantic_tokens generation is so incredibly slow? :(

sobomax commented 10 months ago

Hey, first of all thanks for doing and publishing a great work!

But coming at the practical side, I am looking at rendering my favourite set of matrix quotes:

Inferring: args.text="As you can see, we've had our eye on you for some time now, Mister Anderson."
generate_audio
semantic_tokens: 3.829667329788208
self.featuredir / element_id_prompt=PosixPath('demo/audios/male_voice.wav')
speaker_emb.shape=(1, 512)
acoustic_tokens: 1.4472830295562744
vocoder time: 1.052943229675293
Inferring: args.text="It seems that you've been living two lives."
generate_audio
semantic_tokens: 1.5347039699554443
acoustic_tokens: 0.33491015434265137
vocoder time: 0.3351314067840576
Inferring: args.text="In one life, you're Thomas A Anderson, program writer for a respectable software company."
generate_audio
semantic_tokens: 4.294419050216675
acoustic_tokens: 1.0456955432891846
vocoder time: 1.1635217666625977
Inferring: args.text='You have a Social Security number, you pay your taxes, and you...help your landlady carry out her garbage.'
generate_audio
semantic_tokens: 5.344205617904663
acoustic_tokens: 1.0732522010803223
vocoder time: 1.2692646980285645
Inferring: args.text='The other life is lived in computers, where you go by the hacker alias Neo and are guilty of virtually every computer crime we have a law for.'
generate_audio
semantic_tokens: 6.878687143325806
acoustic_tokens: 1.191270112991333
vocoder time: 1.4288511276245117
Inferring: args.text='One of these lives has a future, and one of them does not.'
generate_audio
semantic_tokens: 3.3958096504211426
acoustic_tokens: 0.30650901794433594
vocoder time: 0.46129584312438965
Inferring: args.text='Have you ever stood and stared at it, marveled at its beauty, its genius? Billions of people just living out their lives, oblivious.'
generate_audio
semantic_tokens: 5.401077508926392
acoustic_tokens: 1.2113051414489746
vocoder time: 1.4591665267944336
Inferring: args.text='Did you know that the first Matrix was designed to be a perfect human world.'
generate_audio
semantic_tokens: 3.0609078407287598
acoustic_tokens: 0.4051539897918701
vocoder time: 0.49598193168640137
Inferring: args.text='Where none suffered.'
generate_audio
semantic_tokens: 1.1007585525512695
acoustic_tokens: 0.275745153427124
vocoder time: 0.3767883777618408
Inferring: args.text='Where everyone would be happy.'
generate_audio
semantic_tokens: 1.5478556156158447
acoustic_tokens: 0.3160576820373535
vocoder time: 0.45749568939208984
Inferring: args.text='It was a disaster.'
generate_audio
semantic_tokens: 1.159377098083496
acoustic_tokens: 0.2947394847869873
vocoder time: 0.3695690631866455
Inferring: args.text='No one would accept the program.'
generate_audio
semantic_tokens: 1.6103193759918213
acoustic_tokens: 0.29689860343933105
vocoder time: 0.3959059715270996
Inferring: args.text='Entire crops were lost.'
generate_audio
semantic_tokens: 1.473541021347046
acoustic_tokens: 1.0539555549621582
vocoder time: 1.0956377983093262
Inferring: args.text='Some believed that we lacked the programming language to describe your perfect world.'
generate_audio
semantic_tokens: 4.125181674957275
acoustic_tokens: 1.230463981628418
vocoder time: 1.5059840679168701
Inferring: args.text='But I believe that as a species, human beings define their reality through misery and suffering.'

The s2a is indeed quite fast, however t2s is absolutely horrible. With SpeechT5 I can get t2s in 100-200ms regardless of the prompt length and with the batch size of 50-100. I am missing something here?

sobomax commented 10 months ago

P.S. Just for the reference I can render 50 of those prompts in a batch with SpeechT5 in about 15 seconds wall time same HW (nothing too fancy, consumer-grade Intel A770). With first audio starting to come out in just 0.6 second on all 50 of them. So assuming semantic generation is batchable, by the time it completes I'd be already half-way through and some of them would already be done playing.

Here is my code if you are curious: https://github.com/sippy/Infernos/blob/main/HelloSippyTTSRT/HelloSippyRTPipe.py

pawel-polyai commented 10 months ago

Hi, thanks for reaching out. The inference speed requires TensorRT-LLM deployment. We are working out to have this open source in upcoming weeks.

yangdongchao commented 8 months ago

Hey, first of all thanks for doing and publishing a great work!

But coming at the practical side, I am looking at rendering my favourite set of matrix quotes:

Inferring: args.text="As you can see, we've had our eye on you for some time now, Mister Anderson."
generate_audio
semantic_tokens: 3.829667329788208
self.featuredir / element_id_prompt=PosixPath('demo/audios/male_voice.wav')
speaker_emb.shape=(1, 512)
acoustic_tokens: 1.4472830295562744
vocoder time: 1.052943229675293
Inferring: args.text="It seems that you've been living two lives."
generate_audio
semantic_tokens: 1.5347039699554443
acoustic_tokens: 0.33491015434265137
vocoder time: 0.3351314067840576
Inferring: args.text="In one life, you're Thomas A Anderson, program writer for a respectable software company."
generate_audio
semantic_tokens: 4.294419050216675
acoustic_tokens: 1.0456955432891846
vocoder time: 1.1635217666625977
Inferring: args.text='You have a Social Security number, you pay your taxes, and you...help your landlady carry out her garbage.'
generate_audio
semantic_tokens: 5.344205617904663
acoustic_tokens: 1.0732522010803223
vocoder time: 1.2692646980285645
Inferring: args.text='The other life is lived in computers, where you go by the hacker alias Neo and are guilty of virtually every computer crime we have a law for.'
generate_audio
semantic_tokens: 6.878687143325806
acoustic_tokens: 1.191270112991333
vocoder time: 1.4288511276245117
Inferring: args.text='One of these lives has a future, and one of them does not.'
generate_audio
semantic_tokens: 3.3958096504211426
acoustic_tokens: 0.30650901794433594
vocoder time: 0.46129584312438965
Inferring: args.text='Have you ever stood and stared at it, marveled at its beauty, its genius? Billions of people just living out their lives, oblivious.'
generate_audio
semantic_tokens: 5.401077508926392
acoustic_tokens: 1.2113051414489746
vocoder time: 1.4591665267944336
Inferring: args.text='Did you know that the first Matrix was designed to be a perfect human world.'
generate_audio
semantic_tokens: 3.0609078407287598
acoustic_tokens: 0.4051539897918701
vocoder time: 0.49598193168640137
Inferring: args.text='Where none suffered.'
generate_audio
semantic_tokens: 1.1007585525512695
acoustic_tokens: 0.275745153427124
vocoder time: 0.3767883777618408
Inferring: args.text='Where everyone would be happy.'
generate_audio
semantic_tokens: 1.5478556156158447
acoustic_tokens: 0.3160576820373535
vocoder time: 0.45749568939208984
Inferring: args.text='It was a disaster.'
generate_audio
semantic_tokens: 1.159377098083496
acoustic_tokens: 0.2947394847869873
vocoder time: 0.3695690631866455
Inferring: args.text='No one would accept the program.'
generate_audio
semantic_tokens: 1.6103193759918213
acoustic_tokens: 0.29689860343933105
vocoder time: 0.3959059715270996
Inferring: args.text='Entire crops were lost.'
generate_audio
semantic_tokens: 1.473541021347046
acoustic_tokens: 1.0539555549621582
vocoder time: 1.0956377983093262
Inferring: args.text='Some believed that we lacked the programming language to describe your perfect world.'
generate_audio
semantic_tokens: 4.125181674957275
acoustic_tokens: 1.230463981628418
vocoder time: 1.5059840679168701
Inferring: args.text='But I believe that as a species, human beings define their reality through misery and suffering.'

The s2a is indeed quite fast, however t2s is absolutely horrible. With SpeechT5 I can get t2s in 100-200ms regardless of the prompt length and with the batch size of 50-100. I am missing something here?

t2s is a AR model, so it slow than s2a.

PolyAI-LDN / pheme

Why semantic_tokens generation is so incredibly slow? :( #6