andreas-abel / nanoBench

A tool for running small microbenchmarks on recent Intel and AMD x86 CPUs.
http://www.uops.info
GNU Affero General Public License v3.0
435 stars 55 forks source link

cmov missing 1->1 latency #9

Closed travisdowns closed 4 years ago

travisdowns commented 4 years ago

I noticed today that cmov is missing latency from operand 1 -> 1.

For cmov, the first operand is read-write, like say add.

andreas-abel commented 4 years ago

I think this one is debatable.

According to Intel's XED (https://github.com/intelxed/xed/blob/master/datafiles/xed-isa.txt), on which the uops.info benchmark generator is based, the first operand is cw (conditional write). This is certainly different from add, which has rw. It is also different from, e.g., shld, which has rcw.

According to the instruction set reference, the operation of cmov is

temp ← SRC
IF condition TRUE
   THEN
      DEST ← temp;
   FI;
ELSE
   IF (OperandSize = 32 and IA-32e mode active)
      THEN
         DEST[63:32] ← 0;
   FI;
FI;

which never reads the first operand (i.e., DEST).

travisdowns commented 4 years ago

The first operand is read because it is not unconditionally overwritten (in the ELSE case). No different than say setcc which implicity reads the higher bytes, or any of the many SSE instructions which may not change the destination value for some inputs.

I am not sure what the significance of the cw thing in XED is, but I don't think it changes how the instruction is implemented: in an out of order processor with renaming it will be executed like any other ALU op: the destination will have a new physical register allocated for it, and the uop itself behaves "as if" it either writes either the second arg or first arg back into the value. I.e., it behaves like a 2-arg destructive output op.

In particular, it never behaves as if the output doesn't have a dependency on the first arg (that would be great though). Even if the move always occurs, it behaves (performance-wise) as if there was a dependency on the first arg.

In any case, you don't have to agree: at least you agree that sometimes there is a dependency on the first input (the not-move case), so one could always create a test that tested both cases in case they one day turn out to be different, or just test the case where the move doesn't happen.

This came up because cmovbe has an interesting performance profile where it has 2 uops, but the latency 2 -> 1 and 1 -> 1 are both only 1, because the first uop only uses flags as input, so is "out of line" in a series of repeated cmov.

See also the discussion on the bottom of this question.

andreas-abel commented 4 years ago

I agree that it would make sense to test this behavior.

in an out of order processor with renaming it will be executed like any other ALU op:

For cmov, it is probably true that is implemented like this.

However, I would disagree with the "like any other ALU op" part. On recent AMD CPUs, for example the SHL r, CL instruction does not read the flags register. If CL=0 (in which case the flags keep their previous value), there is instead a huge penalty (~25 cycles) if an instruction reads the flags afterwards. If CL!=0, it behaves as if the output doesn't have a dependency on the previous value of the flags.

travisdowns commented 4 years ago

However, I would disagree with the "like any other ALU op" part.

You are right, really should have said "like typical ALU ops", because there are definitely weird-behaving ALU ops, like all the variable shifts and all sorts of weird CISC stuff lurking. I keep a not-very-up-to-date list of weird performance anomalies you might be interested in, although I haven't updated it in a while.