ROCm / rocBLAS

Next generation BLAS implementation for ROCm platform
https://rocm.docs.amd.com/projects/rocBLAS/en/latest/
Other
330 stars 153 forks source link

[Feature]: Support equivalent of cublasCherkEx() #1320

Open torrance opened 1 year ago

torrance commented 1 year ago

Is your feature request related to a problem? Please describe.

It's common to have large, low-precision input matrices that you'd like to multiply at full internal precision using rocblas<t>gemm(), possibly (but not necessarily) with output at full precision.

Describe the solution you'd like

Support the equivalent of cublas<t>gemmEx() as described here: https://docs.nvidia.com/cuda/cublas/#cublas-gemmEx

Describe alternatives you've considered

An alternative is to copy the input matrices to double precision first. If the output is not required at full precision, a further copy must be made and the precision truncated. This alternative doubles memory pressure on the GPU and causes extra copying of memory.

TorreZuk commented 1 year ago

Thanks for your report @torrance. rocBLAS supports the equivalent of cublasgemmEx with the function rocblas_gemm_ex described here: https://rocm.docs.amd.com/projects/rocBLAS/en/latest/API_Reference_Guide.html#rocblas-gemm-ex-batched-strided-batched It implements numerous mixed precision and high precision accumulations (HPA) so please review it. If it is missing one you require please provide a list of specific missing data types for inputs, output and compute, in the order of your interest (describing your use case is also helpful). Based on your feedback we can consider adding additional ones but the most common forms should already be implemented.

torrance commented 1 year ago

@TorreZuk Thank you! HIPIFY complained there was no suitable equivalent and I clearly didn't spend long enough verifying that.

If I can hijack my own issue (!), what about a hipblas/rocblas equivalent to cublasCherkEx()? My searching of the documentation (as well as HIPIFY) seem to suggest not, and it's a bit of a stickler to the conversion of this codebase.

TorreZuk commented 1 year ago

Sure we can recycle this for request of an equivalent to cublasCherkEx() which is a new feature request. Can ask if @emankov has any insights into cublasgemmEx() hipify mapping to rocblas_gemm_ex but for all the argument datatype enums maybe those have to be manually chosen?

amcamd commented 1 year ago

Hello @torrance, cublasCherkEx() supports CUDA_C_8I datatype for matrix A. This is a complex number with two 8 bit signed integers. I have some questions about this datatype:

Thanks Andrew

torrance commented 1 year ago

Hi @amcamd

Can you say what application is using this CUDA_C_8I datatype? Real 8 bit integers are used in machine learning, what is the use case for complex 8 bit integers?

Yes, they are needed. Lots of radio astronomy correlators record observations of the sky as simple 8 bit complex integers, which can later be normalised as part of calibration. The 8 bits integer representation has the advantage of having constant deltas between values, as opposed to floating representation. At the high end, we let the integer representation 'saturate' and later flag these values. They are also necessarily complex, since radio astronomy works in the Fourier domain.

We want to avoid converting these to higher precision values because these values make up the raw data of our observations and are absolutely massive in size.

Hope this helps give some context.

amcamd commented 1 year ago

Hi @torrance , Thank you for the context and the use case. I was guessing this is related to radio astronomy and the installations you have in Western Australia.