FR: generate MUL:2, MUL:4, DIV:2 for VOP3 instructions (OpenCL performance)

ROCm / ROCm

AMD ROCm™ Software - GitHub Home

https://rocm.docs.amd.com

MIT License

4.58k stars 376 forks source link

FR: generate MUL:2, MUL:4, DIV:2 for VOP3 instructions (OpenCL performance) #1405

Open preda opened 3 years ago

preda commented 3 years ago

A function such as: double sum2(double x, double y) { return 2 * (x + y); } could be compiled to a single VOP3 GCN instructions such as:

v_add_f64 %0, %1, %2 MUL:2

But this efficient code is not generated because MUL:2 and the like only function correctly with denormals disabled and non-IEEE mode. (denormals and IEEE mode can be set thus:

// turn IEEE mode and denormals off so that mul:2 and div:2 work
#define ENABLE_MUL2() { \
    __asm volatile ("s_setreg_imm32_b32 hwreg(HW_REG_MODE, 9, 1), 0");\
    __asm volatile ("s_setreg_imm32_b32 hwreg(HW_REG_MODE, 4, 4), 7");\
}

Feature request: please provide an OpenCL compilation flag that enables MUL:2 and the like, at the same time disabling denormals and IEEE mode as required. This would allow a developer to choose between the two good things: denormals on one side, and more performance on the other side (by making better use of the power of VOP3 instructions).

ROCmSupport commented 3 years ago

Thanks @preda for reaching out. I will pass this information to compiler team and reach them for the inputs. Thank you.

b-sumner commented 3 years ago

clang/LLVM changes to enable this are now upstream. The compiler options -mno-amdgpu-ieee and -fno-honor-nans are both required to enable such folding.

preda commented 3 years ago

clang/LLVM changes to enable this are now upstream. The compiler options -mno-amdgpu-ieee and -fno-honor-nans are both required to enable such folding.

@b-sumner This is great news, thank you. What is the way to enable this with OpenCL? -- will OpenCL accept those flags directly, or some other OpenCL flags that will be translated to clang -mno-amdgpu-ieee and -fno-honor-nans?

preda commented 3 years ago

see also #967 related.

preda commented 1 year ago

I'm coming back with this request:

Offer a way to enable the GCN Output Modifiers ("OMOD") such as MUL:2 in OpenCL.

Now that LLVM supports generating GCN OMODs (as per @b-sumner 's comment above), how can we make use of that in OpenCL?

abhimeda commented 8 months ago

@preda Hi, is this issue still persisting on the latest version of ROCm? If not can we close this ticket?

preda commented 8 months ago

I'm using OpenCL, and AFAIK there is still no way to take advantage of VOP3 modifiers such as MUL:2 in the OpenCL compilation.

OTOH according to @b-sumner 's comment above, the issue should be fixed for clang/llvm by using -mno-amdgpu-ieee and -fno-honor-nans .

But there does not seem to be a way to pass those compiler flags from OpenCL, so as far as this issue is concerned (note "OpenCL performance" in the issue title), this is not fixed.

ppanchad-amd commented 3 months ago

@preda I will check with the internal team and get back to you. Thanks!