Optimization fails to benefit from amdgcn instruction set in some cases

This is sample implementation of PRNG called sfc32 from PractRand:

hc::array_view<unsigned int> state(4);
// seed state somehow
parallel_for_each(hc::extent<1>(1), [=](hc::index<1> i) [[hc]]
{
    unsigned int a = state[0], b = state[1], c = state[2], counter = state[3];

    unsigned int tmp = a + b + counter++;
    a = b ^ (b >> 9);
    b = c + (c << 3);
    c = ((c << 21) | (c >> (32 - 21))) + tmp;

    state[0] = a; state[1] = b; state[2] = c; state[3] = counter;
});

What goes to tmp is used as next "random" 32 bit integer.

Here is how it is compiled by hcc:

v_add_u32_e32 v2, v3, v2         ; a + b + counter++
v_mul_lo_i32 v8, v4, 9           ; c + (c << 3)
v_lshrrev_b32_e32 v6, 9, v3      ; b ^ (b >> 9)
v_alignbit_b32 v4, v4, v4, 11    ; ((c << 21) | (c >> (32 - 21))) + tmp
v_add_u32_e32 v2, v2, v5         ; a + b + counter++
v_add_u32_e32 v7, 1, v5          ; a + b + counter++;
v_xor_b32_e32 v3, v6, v3         ; b ^ (b >> 9)
v_add_u32_e32 v2, v2, v4         ; ((c << 21) | (c >> (32 - 21))) + tmp

8 instructions executed in 11 cycles to generate one "random" 32 bit integer.

It is nice that optimizer knows that v_alignbit can be used for bit rotations although something like "(a << 21) | (b >> (32 - 21))" is not recognized and compiled into 3 instructions instead of single v_alignbit.

Another nice optimization would be "c + (c << 3)" into single multiplication instruction if this is not amd gpu where 32 bit multiplication has still 4x less throughput. Better would be to use shift and addition which is 2x faster. Or the best would be to use v_lshl_add_u32 which combines left shift and add with third argument so there could be 8 instructions in 8 cycles.

What about to use v_add3_u32 somewhere? On amd gcn expression "a + b + counter" can be computed in single cycle:

v_add3_u32_e32 v2, v3, v2, v5    ; a + b + counter++
v_lshl_add_u32 v8, v4, 3, v4     ; c + (c << 3)
v_lshrrev_b32_e32 v6, 9, v3      ; b ^ (b >> 9)
v_alignbit_b32 v4, v4, v4, 11    ; ((c << 21) | (c >> (32 - 21))) + tmp
v_add_u32_e32 v7, 1, v5          ; a + b + counter++;
v_xor_b32_e32 v3, v6, v3         ; b ^ (b >> 9)
v_add_u32_e32 v2, v2, v4         ; ((c << 21) | (c >> (32 - 21))) + tmp

Now we have 7 instructions in 7 cycles which is about 1.57x faster than first code (although this code is little bigger by 4 bytes).

Seems that optimizer does not know about more instructions which combine two operations like v_add_lshl, v_lshl_or, v_and_or, v_or3, v_xad, v_bfi and more (can it use vop3 input modifiers?).

With some modifications it can be done with using only 5 instructions but that would be already different function.

ROCm / hcc

Optimization fails to benefit from amdgcn instruction set in some cases #858