ROCm / clr

MIT License
85 stars 35 forks source link

Why doesn't mad instruction have carry in ? #6

Closed qiji2023 closed 7 months ago

qiji2023 commented 10 months ago

this is very rediculous!!!!!!!!!!!!!!!!!

cjatin commented 10 months ago

which mad instruction(GCN ISA/fma)? what are you trying to run? Can you provide a sample? What behavior were you expecting?

qiji2023 commented 10 months ago

@cjatin on RDNA v_mad_u64_u32, why doesn't it carry in? for example: a1, c1(carry out) = b1 + d1 +c0(carry in) a2, c2(carry out) = b2 + d2 + c1(carry in)

cjatin commented 7 months ago

not a hip issue

yxsamliu commented 7 months ago

this issue is better raised against amdgpu backend on llvm-project github https://github.com/llvm/llvm-project/issues in the hope that the hardware limitation may be reported to right channel.

Based on ISA manual

https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf

the instruction does not have carry in. this may be a design decision due to cost/performance balance. since support carry in with mad will increase its footage on chip design, but you can still use add/multiply to handle need for carry in.

how much performance gain can be achieved if there is carry in in mad instruction?

qiji2023 commented 7 months ago

this issue is better raised against amdgpu backend on llvm-project github https://github.com/llvm/llvm-project/issues in the hope that the hardware limitation may be reported to right channel.

Based on ISA manual

https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf

the instruction does not have carry in. this may be a design decision due to cost/performance balance. since support carry in with mad will increase its footage on chip design, but you can still use add/multiply to handle need for carry in.

how much performance gain can be achieved if there is carry in in mad instruction?

For big integer multiply. If I use the current instruction, I need mad -> add -> mad -> ... it is 6 clk for single operation but if mad instruction have carry in, I only need mad->mad -> ... it is only 4 clk for single operation

I can gain at least 50% performance improvement.