NVIDIA / nv-wavenet

Reference implementation of real-time autoregressive wavenet inference
BSD 3-Clause "New" or "Revised" License
735 stars 126 forks source link

question about performance #18

Closed suntao2012 closed 6 years ago

suntao2012 commented 6 years ago

I have built for pascal/volta with sm_60/sm_70, and it runs well on P100 & V100. but when I set all parameters the same but with different precisions fp16/fp32, but got the similar performance when the mode is PERSISTENT, both on P100 and V100, but it is totally different when set to other mode(SINGLE, DUAL). so, first question is, why does this happened?

second question is, the V100 has enable Tensor Core, Did this souce code use tensor core?

and last question is, the V100 is up to date, so it is supposed to be better, however, the actually performance is close even a little less than P100, so why?

    GP100 V100-PCIE
Graphics clock   1556 MHz 1380 MHz
Memory clock   715 MHz 877 MHz
Medium-Single FP16 28.04 26.97
Medium-Single FP32 10.69 11.26
Medium-Dual FP16 32.40 31.02
Medium-Dual FP32 14.68 15.28
Medium-Persistent FP16 41.85 42.83
Medium-Persistent FP32 37.50 42.02
BrianPharris commented 6 years ago

Regarding fp16/fp32 performance, the single- and dual-block implementations are limited by the time to stream weights into the Streaming Multiprocessor -- fp16 weights are half the size of fp32 weights. The persistent variant, on the other hand, loads all weights into the Streaming Multiprocessor registers so the weight bandwidth is not in the performance-critical path. This is discussed more fully on the blog post at https://devblogs.nvidia.com/nv-wavenet-gpu-speech-synthesis/.

Regarding P100 vs V100: Since this is just batch=1, performance is mostly a function of core clock. Since your P100 clock is higher than V100, it is expected that it will perform better on a model which fits into both GPUs. Since V100 is larger, it can support larger models in the persistent mode. V100 will also provide higher throughput (batch size) than P100 in the single- and dual-cta modes, as it can run more blocks in parallel.