ROCm / rocWMMA

rocWMMA
https://rocm.docs.amd.com/projects/rocWMMA/
MIT License
90 stars 26 forks source link

Advantages of Using rocWMMA over Compiler Intrinsics for CUDA to ROCm Transition #202

Closed xinyi-li7 closed 1 year ago

xinyi-li7 commented 1 year ago

Hello,

I'm currently in the process of transitioning from CUDA to ROCm. During this transition, I've come to understand that rocWMMA can serve as a mapping library for the "Warp matrix functions nvcuda::wmma" that I used to work with in CUDA. Nevertheless, I've also noticed that we have "compiler intrinsics", which provide functionality quite similar to rocWMMA, see this link.

Specifically, I've observed that we can substitute the load operations in rocWMMA with specific built-in intrinsics. This realization has led me to question why I might choose to use rocWMMA over these compiler intrinsics.

What advantages does rocWMMA provide over these compiler intrinsics? Is it more efficient in certain situations or does it offer any unique functionalities? Any insight on this would be highly appreciated.

Thank you for your assistance.

cgmillette commented 1 year ago

Hi @xinyi-li7, thanks for your question!

rocWMMA is a library that is intended to solve matrix-multiply-accumulate problems with simplifications to block-wise decomposition into fragments. It wraps several 'quality of life' improvements over raw builtins that focus on basic multiply-accumulate functionality. At the very lowest level, rocWMMA does indeed call builtins over each mult-accum in the K direction - however adds several more layers of utility.

Firstly, users transitioning from nvcuda::wmma to rocWMMA have little 'hipification' to do in terms of API syntax and functionality. The rocWMMA API goes further to map block-wise decomposition of problems into fragments of block (M,N,K) sizes. Depending on the defined data types and block sizes, rocWMMA will call the associated __builtin (there are many for each datatype and block size), unrolling it over the required iterations in the K direction.

Secondly, MFMA __builtins have specific layouts for in-register inputs and accumulation. A feature that rocWMMA offers is that no matter what your data format is in (row_major, col_major) - the rocWMMA fragments of tile sizes 16 and 32 will always be 'MFMA friendly' and correctly mapped to MFMA input and accumulation requirements. There is no need out of the box to re-arrange data in LDS to become MFMA friendly.

Thirdly, MFMA builtin support differs between gfx908, gfx90a architectures. Moreover WMMA builtin support offers similar functionality as MFMA, but requires different layouts and has different datatype and block size support on the gfx11xx architectures. rocWMMA is portable across gfx908, gfx90a and gfx11xx architectures with clearly defined differences in supported types and block sizes. This library will hide a lot of the nasty portability details if you are targeting your application to support multiple GPU architectures.

Lastly, rocWMMA offers a variety of simplified and performance samples for different types of algorithms that you can explore. High-performance GEMMS can be achieved with rocWMMA using different techniques (such as multi-wave collaboration and data-sharing in LDS) - a lot of these details offered through the library API.

Similar to other libraries, rocWMMA is intended to save the user development time and improve their experience by providing solutions to many technical and development challenges under the hood. I hope to have shown that rocWMMA is more of an embellishment on top of MFMA and WMMA __builtins that automates and insulates the user from a lot of the otherwise time consuming details of portability and consistency across several architectures through a single API.

Cheers!

cgmillette commented 1 year ago

And a second note is that __builtin interfaces for MFMA and WMMA are subject to change between releases, whereas rocWMMA API will be always compatible

xinyi-li7 commented 1 year ago

Hi @cgmillette, I appreciate the comprehensive explanation you provided—it's been truly helpful!

I'd like to confirm a couple of things: Is rocWMMA only available on ROCM/5.4.0? And does the present version of Hipify support the transition from nvidia::wmma to rocWMMA?

Many thanks!

cgmillette commented 1 year ago

First public release for rocWMMA was with ROCm 5.2. I recommend the latest ROCm release to keep up to date with bug fixes and performance improvements.

At this time, there is no automation through 'Hipify' tools, however the API calls / interfaces are very similar between nvcuda::wmma and rocWMMA. Just note carefully the block size type support between both.

cgmillette commented 1 year ago

Hi @cgmillette, I appreciate the comprehensive explanation you provided—it's been truly helpful!

I'd like to confirm a couple of things: Is rocWMMA only available on ROCM/5.4.0? And does the present version of Hipify support the transition from nvidia::wmma to rocWMMA?

Many thanks!

You're very welcome!

xinyi-li7 commented 1 year ago

Thank you!