Open isaacleeai opened 5 years ago
@isaacleeai What's your batch size? Also, which GPU are you using? Seems like one of the 10 series from the screenshots you posted.
Before I start, let's consider that there are 2 types of inefficiency - inefficient computation (CUDA activity on the timeline, but poor GPU utilization) and idle time (GPU not doing anything at all, i.e. blank timeline). The former should be really referred to as poor utilization within the kernel, so I like to talk about the latter simply as idle time.
Note: the profiler itself adds overheads which widen the gaps between kernels. If there are many small kernels, the gaps relative to kernel durations can be very large. For longer-duration kernels, the profiler overhead is negligible. For kernels lasting ~10 us, the profiler exaggerates the gaps by about 1.5x in many cases. That's a number obtained with nvprof, Nsight Systems records more data so it may even have a higher overhead, depending on whether you're profiling the CPU or not. I'd recommend not profiling the CPU when just looking at the GPU situation.
There are several issues at play here:
For small batch sizes, especially 1, a compute-bound problem becomes a bandwidth-bound problem (e.g. GEMM becomes a GEMV), so the GPU is inefficiently used. For example for Tesla V100, the guideline is that to best use tensor cores, one should have 125 TFLOPs (total tensor core compute) / 950 GB/s (DRAM bandwidth) or ~ 131.6 FLOPs per byte moved. That is almost never the case with small batch sizes. Also, even in fp32, small batch sizes, and in particularly batch size 1, result in no on-chip data reuse, which is also why it's a bandwidth-bound problem rather than a compute-bound one at that point. This falls under poor GPU utilization during kernel execution.
WaveGlow can still do a lot of work in parallel for tiny batch sizes because of the nature of that architecture. Tacotron is autoregressive, WaveGlow is not. For Tacotron, the main driver of efficient GPU utilization during kernel execution is a large batch size.
PyTorch doesn't have good scheduling due to eager mode. Between-kernel gaps (idle time) due to launch latencies are about 3 microseconds when the work is pipelined (think launching one CUDA kernel after another in C++), but without pipelining, e.g. when there are extra Python overheads, an exposed launch latency can be up to 10 microseconds. In many cases, PyTorch's ATen linalg backend itself adds another 10 microseconds.
For small batch sizes (e.g. bs=1), kernels take less time since there's less work to do. So, you end up getting hit first by low GPU utilization when the kernel is executing, and then the kernel finishes quickly and the Python and PyTorch (ATen) overheads add up to expose a bigger gap between kernels. I saw many cases with batch size 1 where a Tacotron kernel would take 8 microseconds to execute, and 25 or more to schedule.
There are generally 2 things that can be done to improve GPU utilization:
Increase the batch size. This will end up reducing idle time and increase the number of batches processed per second, but will increase the latency to first result. For instance, with a larger batch size kernel durations will be longer, and you may need to wait to queue up incoming requests to create the larger batch in the first place (unless you have "data at rest," i.e. an on-disk dataset, in which case increasing the batch size a no-brainer).
Just-in-time (JIT) compile the model. The details about how to use the Python APIs to JIT the model can be found here. The JIT does 2 things:
We are planning to share JITted code at some point, but feel free to do some of the work on your own for now if you need it right away. Here is a small example of code that PyTorch would JIT into a single CUDA kernel, eliminating gaps. There's way more that the JIT can do for better scheduling, as I mentioned above. Make sure to check it out.
Thanks a lot for your detailed analysis.
For small batch sizes, especially 1, a compute-bound problem becomes a bandwidth-bound problem (e.g. GEMM becomes a GEMV), so the GPU is inefficiently used. For example for Tesla V100, the guideline is that to best use tensor cores, one should have 125 TFLOPs (total tensor core compute) / 950 GB/s (DRAM bandwidth) or ~ 131.6 FLOPs per byte moved. That is almost never the case with small batch sizes. Also, even in fp32, small batch sizes, and in particularly batch size 1, result in no on-chip data reuse, which is also why it's a bandwidth-bound problem rather than a compute-bound one at that point. This falls under poor GPU utilization during kernel execution.
okay, so the small amount of cumputation done on data makes the kernel bandwidth-bound, instead of compute-bound. Makes sense.
PyTorch doesn't have good scheduling due to eager mode. Between-kernel gaps (idle time) due to launch latencies are about 3 microseconds when the work is pipelined (think launching one CUDA kernel after another in C++), but without pipelining, e.g. when there are extra Python overheads, an exposed launch latency can be up to 10 microseconds. In many cases, PyTorch's ATen linalg backend itself adds another 10 microseconds.
I assumed that the gaps between CUDA API call was overhead added by the Pytorch library, but never actually profiled it. So it's great that you have confirmed.
WaveGlow can still do a lot of work in parallel for tiny batch sizes because of the nature of that architecture. Tacotron is autoregressive, WaveGlow is not. For Tacotron, the main driver of efficient GPU utilization during kernel execution is a large batch size.
This was the most puzzling part for me -- why Tacotron2 would yield such high idle time when Waveglow doesn't. Your saying it's simply due to the difference of the two architectures. Am I right?
it fuses many pointwise layers eliminating launch latencies and doing several layers' worth of work in one kernel. it schedules a bunch of work together in C++ for things it cannot fuse, so the hop between C++ and Python is eliminated, reducing gaps between kernels.
This makes it sound like there woudln't be much difference between using TensorRT and pytorch JIT, as tensorRT's main method of increasing performance is by grouping small kernels into a large one ( correct me if I'm wrong ). Do you have any details benchmark of the two frameworks to compare?
We are planning to share JITted code at some point, but feel free to do some of the work on your own for now if you need it right away.
Just out of curiosity, since this is published by Nvidia, why not use TensorRT?
@ThisIsIsaac Thanks for the reply. A couple of clarifications:
I assumed that the gaps between CUDA API call was overhead added by the Pytorch library, but never actually profiled it. So it's great that you have confirmed.
That's only partly the case, as I mentioned. It really depends on a lot of things, such as CPU utilization, whether the latency is hidden by pipelined launches, whether Python is running garbage collection, etc. I did find that JITting the code reduces the latency, but every framework does add a little bit of overhead. In general, CUDA itself will add 3 us between kernels, and 10 us if there is no back-to-back scheduling. With CUDA Graphs, scheduling can be even more efficient, but so far they haven't been added to frameworks. I assume that's coming in the foreseeable future.
This was the most puzzling part for me -- why Tacotron2 would yield such high idle time when Waveglow doesn't. Your saying it's simply due to the difference of the two architectures. Am I right?
In practice, for short utterances, e.g. when generating streaming audio 88 mels (~1 second at 22.05 kHz) at a time, Tacotron2 actually takes slightly less time per step than WaveGlow. The 88 mel scenario is for instance one in which one can generate the first second in a small fraction of a second (below 200 ms, possibly less), and start streaming audio while simultaneously generating new audio. This is desirable because users have low latency tolerances, so one needs to have low time to first audio, while still being to generate faster than real-time, to e.g. provide enough of a time budget to cover audio transmission over the internet, e.g. when using cloud inference.
In the above scenario, Tacotron takes less time than WaveGlow, but Tacotron's GPU utilization is much worse than WaveGlow's. WaveGlow is reversible-flow based, and it's perfectly parallelizable. This is unlike WaveNet for example, which is auto-regressive. In autoregessive models, parallelism is limited mostly to the ability to process a lot of computation for the current time step. So, if there isn't a lot of compute per layer in the current time step (e.g. small batch size), then not all CUDA cores are utilized, the kernel takes little time, often too little to make the launch latency of the next kernel small relative to total computation time.
I recommend comparing WaveGlow to WaveNet to appreciate this difference. Tacotron, just like WaveNet, RNN-based seq2seq models and some others, is auto-regressive, hence it's more prone to small-batch performance problems.
Just out of curiosity, since this is published by Nvidia, why not use TensorRT?
NVIDIA does a lot of things, both software engineering and research. We open-source a lot of things in both areas. Examples of software engineering extensions that boost performance and extend frameworks include DALI for high-performance image-based data pipelines and Apex for easy-to-use fp16 mixed precision and other extra features, such as synchronous BatchNorm. WaveGlow and Tacotron were released by one of NVIDIA's research teams, so the focus was not necessarily on maximum performance. Of course, some care was taken to get good performance, e.g. WaveGlow was designed for maximum performance at the architectural level. The PyTorch port of Tacotron2 from TensorFlow was released in large part because it was necessary to supply WaveGlow with mel spectrograms, but we realize that the architecture is suboptimal and that one could still do more to optimize performance.
It's also important to realize that the primary goal of open-sourcing Tacotron2 and WaveGlow wasn't max perf, but reproducibility. The WaveGlow code accompanied the paper, and it's pretty clear that the community expects the ability to reproduce results when a new publication is released. After publishing, one can iterate on optimizing performance.
Since the Tacotron2 PyTorch port was released to enable users to reproduce training on existing datasets as well as to let users train on their own data, inference was a secondary consideration. Again, WaveGlow has been designed from the ground up to provide extremely fast inference, Tacotron was simply a port of an existing text-to-mel spectrogram network to enable WaveGlow, not to beat perf or accuracy state of the art.
Given the primary focus on training reproducibility and WaveGlow's training and inference perf, Tacotron's perf was of secondary concern. We definitely wouldn't have wanted to write Tacotron in TensorRT for those reasons. Of course it's perfectly reasonable to say that one may eventually want a TensorRT port, but IMHO this also wouldn't be the primary mission of a research team, from which this implementation originates. It's possible that NVIDIA might release a TensorRT port some day.
As you might know, TensorRT also has limitations - there are layers that PyTorch supports than what TensorRT supports, so while some models can be exported from PyTorch into ONNX and then loaded directly into TensorRT, that's not the case for all models. This sometimes requires writing TensorRT plugins. However, if you're a TensorFlow or MXNet user, those frameworks already have TensorRT integated into the runtime. You can simply choose to switch to TensorRT inference, and the graph partitioner will extract subgraphs capable of being executed by TensorRT, and the framework will execute remaining nodes that currently don't have a TensorRT implementation. You can find more information about TF-TensorRT here. The documentation for MXNet-TensorRT can be found here.
Thanks for all the wonderful details. And I really appreciate the time you took to answer all my questions with such precision.
@ThisIsIsaac Glad to help!
can I parallely infer using waveglow or the operation is done serially(infer one text and another i.e., queue based). Does waveglow supports streaming during inference?
@ajaysg Serially and no.
Tacotron output, not the input, is streaming by design. Waveglow does support streaming inference but the feature has not been implemented.
@rafaelvalle is it going to be released soon?
@ajaysg try using CUDA MPS if you want multiple kernels to run concurrently (actually concurrently instead of timeslicing)
Although both Waveglow and Taoctron2 use the same version of pytorch, on inference -- NOT training --, tacotron2 displays really low utilization, while Waveglow shows 100% utlization. I would ilke to share the results of the profiling I have done with nsight systems to show you how, for most of the time, when running
python inference.py
with tacotorn2, the GPU is mostly idle. In contrast, waveglow's inference run withpython inference.py
with waveglow repo shows nearly 100% utlization.This is image of tacotron's utilization. The skyblue bar in "CUDA" row shows the time when there was ANY computation done on the GPU. Absence of the blue bar indicates NO computation whatsoever on the GPU.
If you zoom in further:
And further:
There are HUGE gaps between the blue snippets, indicating that for most of the time, the GPU is idle when running tacotron2.
Contrast that to the image of waveglow's profile result:
Even when I zoon in further, there is hardly any gap in the skyblue bar.