Closed matyas-streamhpc closed 1 month ago
BTW, it should be mentioned somewhere that to fully utilize all SIMD lines/possible threads in the block x-block size should be a multiple of warp size.
When being pedantic: the block size (x*y*z
) has to be a multiple of the warp size for full utilization
Most of these sections are very close to the performance guidelines of the cuda programming guide, sometimes almost quoting it directly. I don't think that's a good practice, especially as some parts don't apply to AMDs GPUs at all, and on top of that the cuda programmign guide does not have a permissive license from what I can tell
A better place for inspiration might be gpuopen, that already has some performance guides for e.g. rdna https://gpuopen.com/learn/rdna-performance-guide/
I am not sure that is best strategy either, but the concept was accepted as a first version. It is not quoting directly the mentioned document, but there is overlap in the content. Personally, I would have appreciated every pieces of recommendation, both in format and in content.
Nonetheless, we always have the opportunity to improve it and make the documentation better for the satisfaction of the developers.
Most of this PR changes has been merged in, while the leftover is here: #3483
BTW, it should be mentioned somewhere that to fully utilize all SIMD lines/possible threads in the block x-block size should be a multiple of warp size.