Closed suntao2012 closed 6 years ago
Regarding fp16/fp32 performance, the single- and dual-block implementations are limited by the time to stream weights into the Streaming Multiprocessor -- fp16 weights are half the size of fp32 weights. The persistent variant, on the other hand, loads all weights into the Streaming Multiprocessor registers so the weight bandwidth is not in the performance-critical path. This is discussed more fully on the blog post at https://devblogs.nvidia.com/nv-wavenet-gpu-speech-synthesis/.
Regarding P100 vs V100: Since this is just batch=1, performance is mostly a function of core clock. Since your P100 clock is higher than V100, it is expected that it will perform better on a model which fits into both GPUs. Since V100 is larger, it can support larger models in the persistent mode. V100 will also provide higher throughput (batch size) than P100 in the single- and dual-cta modes, as it can run more blocks in parallel.
I have built for pascal/volta with sm_60/sm_70, and it runs well on P100 & V100. but when I set all parameters the same but with different precisions fp16/fp32, but got the similar performance when the mode is PERSISTENT, both on P100 and V100, but it is totally different when set to other mode(SINGLE, DUAL). so, first question is, why does this happened?
second question is, the V100 has enable Tensor Core, Did this souce code use tensor core?
and last question is, the V100 is up to date, so it is supposed to be better, however, the actually performance is close even a little less than P100, so why?