ROCm / hcc

HCC is an Open Source, Optimizing C++ Compiler for Heterogeneous Compute currently for the ROCm GPU Computing Platform
https://github.com/RadeonOpenCompute/hcc/wiki
Other
429 stars 107 forks source link

increase KERNARG_BUFFER_SIZE from 512 to 4k #1377

Closed jeffdaily closed 4 years ago

jeffdaily commented 4 years ago

Decrease HCC_ASYNCOPS_SIZE from 16k to 1k. HCC_KERNARG_BUFFER_SIZE is now an environment variable. HCC_KERNARG_POOL_SIZE is now an environment variable.

jeffdaily commented 4 years ago

Since this PR also reduces the asyncops size, it could replace #1261 .

emankov commented 4 years ago

Justification for all the numbers is needed.

jeffdaily commented 4 years ago

@emankov

  1. CUDA default kernarg size is 4k. __global__ function parameters are passed to the device via constant memory and are limited to 4 KB. From https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#function-parameters .
  2. PyTorch translate model uses a number of kernels with kernargs > 512 bytes, the current default. Changing the default kernarg buffer size results in a 30% performance improvement since kernargs are no longer allocated on demand.
  3. Since kernarg buffer size is increased by 8 times, HCC_ASYNCOPS_SIZE is reduced by 16 times to keep memory use roughly the same in the worst case, assuming two streams fully queuing to the same device.
jeffdaily commented 4 years ago

@emankov The most important change in this PR is the increase in the default kernarg buffer size. If needed, would such a change be acceptable without the other changes?

emankov commented 4 years ago

@jeffdaily, thank you for explanation. Could you please add just a few words in comments?

jeffdaily commented 4 years ago

@emankov comments added in commit https://github.com/RadeonOpenCompute/hcc/pull/1377/commits/f0e2b40f13086f73565de6142a792f889f29b7b9.