8 instructions executed in 11 cycles to generate one "random" 32 bit integer.
It is nice that optimizer knows that v_alignbit can be used for bit rotations although something like "(a << 21) | (b >> (32 - 21))" is not recognized and compiled into 3 instructions instead of single v_alignbit.
Another nice optimization would be "c + (c << 3)" into single multiplication instruction if this is not amd gpu where 32 bit multiplication has still 4x less throughput. Better would be to use shift and addition which is 2x faster. Or the best would be to use v_lshl_add_u32 which combines left shift and add with third argument so there could be 8 instructions in 8 cycles.
What about to use v_add3_u32 somewhere? On amd gcn expression "a + b + counter" can be computed in single cycle:
Now we have 7 instructions in 7 cycles which is about 1.57x faster than first code (although this code is little bigger by 4 bytes).
Seems that optimizer does not know about more instructions which combine two operations like v_add_lshl, v_lshl_or, v_and_or, v_or3, v_xad, v_bfi and more (can it use vop3 input modifiers?).
With some modifications it can be done with using only 5 instructions but that would be already different function.
This is sample implementation of PRNG called sfc32 from PractRand:
What goes to tmp is used as next "random" 32 bit integer.
Here is how it is compiled by hcc:
8 instructions executed in 11 cycles to generate one "random" 32 bit integer.
It is nice that optimizer knows that v_alignbit can be used for bit rotations although something like "(a << 21) | (b >> (32 - 21))" is not recognized and compiled into 3 instructions instead of single v_alignbit.
Another nice optimization would be "c + (c << 3)" into single multiplication instruction if this is not amd gpu where 32 bit multiplication has still 4x less throughput. Better would be to use shift and addition which is 2x faster. Or the best would be to use v_lshl_add_u32 which combines left shift and add with third argument so there could be 8 instructions in 8 cycles.
What about to use v_add3_u32 somewhere? On amd gcn expression "a + b + counter" can be computed in single cycle:
Now we have 7 instructions in 7 cycles which is about 1.57x faster than first code (although this code is little bigger by 4 bytes).
Seems that optimizer does not know about more instructions which combine two operations like v_add_lshl, v_lshl_or, v_and_or, v_or3, v_xad, v_bfi and more (can it use vop3 input modifiers?).
With some modifications it can be done with using only 5 instructions but that would be already different function.