NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.15k stars 3.16k forks source link

tacotron get slower using pytorch TRT #545

Open terryyizhong opened 4 years ago

terryyizhong commented 4 years ago

Hi, I follow your guide, get the right docker env. I run those script successfully, but the result I got is weird.

My Gpu is V100, I test three type of inference: No amp, amp-run (fp16), and pytorch trt. the total latency of this three are: 1.54, 1.3627, 1.34 (decrease as expect) the waveglow latency decrease as expect: 0.46065, 0.3097, 0.1374 but the tacotron latency of trt get even longer! 1.08, 1.053, 1.208

the tacotron2 model I use is: https://ngc.nvidia.com/catalog/models/nvidia:tacotron2pyt_fp16/files?version=3

Because you only provide the result of trt accerlerate on T4 GPU. I want to know is there anything wrong with my result? Why the trt make tacotron2 inference slower??

Thanks for your patience and Looking forward your reply.

terryyizhong commented 4 years ago

one more question, the result of inference performance of T4 GPU have big difference between: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2 and https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/Tacotron2/trt/README.md

I want to know what cause this difference, for the fp16 result. @GrzegorzKarchNV

machineko commented 4 years ago

You can see that results in main readme wasn't updated for a lot longer then results in trt repo. And even so u can check https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/trtis_cpp for even better results. I've tested trt_cpp on both rtx 2080ti and t4 on fp16 run and in both cases results was a lot faster than pytorch (on t4 results was slower than presented in repo but probably because I've used AWS and also I've used different sequence length)

terryyizhong commented 4 years ago

I've tested trt_cpp on both rtx 2080ti and t4 on fp16 run and in both cases results was a lot faster than pytorch (on t4 results was slower than presented in repo but probably because I've used AWS and also I've used different sequence length)

Thanks for your reply. I test trt_cpp and get better inference speed now. But I cannot test sequence length of 128 as the result showed in README. The engine will generate 47.6s audio if I use the defualt text and setting in run_trtis_benchmark_client.sh. Do you counter this problem?

machineko commented 4 years ago

Nope I never used benchmark scripts, also I used my own trained models.