Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[llvm-exegesis] analysis: incorrect analysis for chained(?) instructions? #40245

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR41275
Status NEW
Importance P enhancement
Reported by Roman Lebedev (lebedev.ri@gmail.com)
Reported on 2019-03-28 07:31:44 -0700
Last modified on 2019-03-29 10:54:04 -0700
Version trunk
Hardware PC Linux
CC clement.courbet@gmail.com, gchatelet@google.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
llvm-exegesis sometimes chains instructions so that it can measure
characteristsics.
It happens e.g. for CMP, TEST, BT, SETcc, CVT*, etc.

But in analysis, that chaining does not appear to be accounted for.
Example:

$ ./bin/llvm-exegesis -num-repetitions=10000 -mode=latency -opcode-name=BT32rr
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-1ce014.o
---
mode:            latency
key:
  instructions:
    - 'BT32rr R12D EDX'
    - 'CMOVA16rr DX DX BP'
  config:          ''
  register_initial_values:
    - 'R12D=0x0'
    - 'EDX=0x0'
    - 'DX=0x0'
    - 'BP=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: latency, value: 1.0454, per_snippet_value: 2.0908 }
error:           ''
info:            Repeating two instructions
assembled_snippet:
55415441BC00000000BA0000000066BA000066BD0000410FA3D4660F47D5410FA3D4660F47D5410FA3D4660F47D5410FA3D4660F47D5410FA3D4660F47D5410FA3D4660F47D5410FA3D4660F47D5410FA3D4660F47D5415C5DC3
...
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-591007.o
---
mode:            latency
key:
  instructions:
    - 'BT32rr R11D R8D'
    - 'CMOVL32rr R8D R8D R8D'
  config:          ''
  register_initial_values:
    - 'R11D=0x0'
    - 'R8D=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: latency, value: 0.8549, per_snippet_value: 1.7098 }
error:           ''
info:            Repeating two instructions
assembled_snippet:
41BB0000000041B800000000450FA3C3450F4CC0450FA3C3450F4CC0450FA3C3450F4CC0450FA3C3450F4CC0450FA3C3450F4CC0450FA3C3450F4CC0450FA3C3450F4CC0450FA3C3450F4CC0C3
...
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-fba941.o
---
mode:            latency
key:
  instructions:
    - 'BT32rr R12D ESI'
    - 'SETBEr SIL'
  config:          ''
  register_initial_values:
    - 'R12D=0x0'
    - 'ESI=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: latency, value: 1.0265, per_snippet_value: 2.053 }
error:           ''
info:            Repeating two instructions
assembled_snippet:
415441BC00000000BE00000000410FA3F4400F96C6410FA3F4400F96C6410FA3F4400F96C6410FA3F4400F96C6410FA3F4400F96C6410FA3F4400F96C6410FA3F4400F96C6410FA3F4400F96C6415CC3
...
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-ad11fe.o
---
mode:            latency
key:
  instructions:
    - 'BT32rr R9D R12D'
    - 'SETB_C32r R12D'
  config:          ''
  register_initial_values:
    - 'R9D=0x0'
    - 'R12D=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: latency, value: 0.3787, per_snippet_value: 0.7574 }
error:           ''
info:            Repeating two instructions
assembled_snippet:
415441B90000000041BC00000000450FA3E1450FA3E1450FA3E1450FA3E1450FA3E1450FA3E1450FA3E1450FA3E1415CC3
...
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-97701b.o
---
mode:            latency
key:
  instructions:
    - 'BT32rr EBX EBX'
    - 'RCR16ri BX BX i_0x1'
  config:          ''
  register_initial_values:
    - 'EBX=0x0'
    - 'BX=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: latency, value: 5.0276, per_snippet_value: 10.0552 }
error:           ''
info:            Repeating two instructions
assembled_snippet:
53BB0000000066BB00000FA3DB66C1DB010FA3DB66C1DB010FA3DB66C1DB010FA3DB66C1DB010FA3DB66C1DB010FA3DB66C1DB010FA3DB66C1DB010FA3DB66C1DB015BC3
...
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-d11a67.o
---
mode:            latency
key:
  instructions:
    - 'BT32rr R14D R10D'
    - 'CMOVLE32rr R14D R14D R9D'
  config:          ''
  register_initial_values:
    - 'R14D=0x0'
    - 'R10D=0x0'
    - 'R9D=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: latency, value: 0.9323, per_snippet_value: 1.8646 }
error:           ''
info:            Repeating two instructions
assembled_snippet:
415641BE0000000041BA0000000041B900000000450FA3D6450F4EF1450FA3D6450F4EF1450FA3D6450F4EF1450FA3D6450F4EF1450FA3D6450F4EF1450FA3D6450F4EF1450FA3D6450F4EF1450FA3D6450F4EF1415EC3
...
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-828f4c.o
---
mode:            latency
key:
  instructions:
    - 'BT32rr ESI EAX'
    - 'SETPr SIL'
  config:          ''
  register_initial_values:
    - 'ESI=0x0'
    - 'EAX=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: latency, value: 0.7112, per_snippet_value: 1.4224 }
error:           ''
info:            Repeating two instructions
assembled_snippet:
BE00000000B8000000000FA3C6400F9AC60FA3C6400F9AC60FA3C6400F9AC60FA3C6400F9AC60FA3C6400F9AC60FA3C6400F9AC60FA3C6400F9AC60FA3C6400F9AC6C3
...
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-9f6fb0.o
---
mode:            latency
key:
  instructions:
    - 'BT32rr R9D R11D'
    - 'CMOVP64rr R9 R9 R10'
  config:          ''
  register_initial_values:
    - 'R9D=0x0'
    - 'R11D=0x0'
    - 'R9=0x0'
    - 'R10=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: latency, value: 0.8673, per_snippet_value: 1.7346 }
error:           ''
info:            Repeating two instructions
assembled_snippet:
41B90000000041BB0000000049B9000000000000000049BA0000000000000000450FA3D94D0F4ACA450FA3D94D0F4ACA450FA3D94D0F4ACA450FA3D94D0F4ACA450FA3D94D0F4ACA450FA3D94D0F4ACA450FA3D94D0F4ACA450FA3D94D0F4ACAC3
...
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-3c8869.o
---
mode:            latency
key:
  instructions:
    - 'BT32rr R14D R10D'
    - 'ADC64rr_REV R10 R10 R9'
  config:          ''
  register_initial_values:
    - 'R14D=0x0'
    - 'R10D=0x0'
    - 'R10=0x0'
    - 'R9=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: latency, value: 1.0499, per_snippet_value: 2.0998 }
error:           ''
info:            Repeating two instructions
assembled_snippet:
415641BE0000000041BA0000000049BA000000000000000049B90000000000000000450FA3D64D13D1450FA3D64D13D1450FA3D64D13D1450FA3D64D13D1450FA3D64D13D1450FA3D64D13D1450FA3D64D13D1450FA3D64D13D1415EC3
...
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-03e41c.o
---
mode:            latency
key:
  instructions:
    - 'BT32rr EDX EDI'
    - 'RCL64rCL RDI RDI'
  config:          ''
  register_initial_values:
    - 'EDX=0x0'
    - 'EDI=0x0'
    - 'RDI=0x0'
    - 'CL=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: latency, value: 3.7896, per_snippet_value: 7.5792 }
error:           ''
info:            Repeating two instructions
assembled_snippet:
BA00000000BF0000000048BF0000000000000000B1000FA3FA48D3D70FA3FA48D3D70FA3FA48D3D70FA3FA48D3D70FA3FA48D3D70FA3FA48D3D70FA3FA48D3D70FA3FA48D3D7C3
...

So for a single opcode (BT32rr), we got several wildly different per-
instruction latency values: 1.0454, 0.8549, 1.0265, 0.3787, 5.0276, 0.9323,
0.7112, 0.8673, 1.0499, 3.7896.
These are the values that analysis mode will use.
It does not appear to account for the second instruction in the snippet.

I'm not sure what it *should* be doing, but that does not seem like the correct
thing to do?
Quuxplusone commented 5 years ago
---
mode:            latency
key:
  instructions:
    - 'BT32rr R11D R11D'
    - 'RCR8rCL R11B R11B'
  config:          ''
  register_initial_values:
    - 'R11D=0x0'
    - 'R11B=0x0'
    - 'CL=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: latency, value: 5.7795, per_snippet_value: 11.559 }
error:           ''
info:            Repeating two instructions
assembled_snippet:
41BB0000000041B300B100450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DBC3
...
---
mode:            latency
key:
  instructions:
    - 'RCR8rCL R12B R12B'
  config:          ''
  register_initial_values:
    - 'R12B=0x0'
    - 'CL=0x0'
    - 'EFLAGS=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 1000
measurements:
  - { key: latency, value: 11.288, per_snippet_value: 11.288 }
error:           ''
info:            Repeating a single implicitly serial instruction
assembled_snippet:
415441B400B1004883EC08C7042400000000C7442404000000009D41D2DC41D2DC41D2DC41D2DC41D2DC41D2DC41D2DC41D2DC41D2DC41D2DC41D2DC41D2DC41D2DC41D2DC41D2DC41D2DC415CC3
...

Shot-into-the-dark:
If we have instruction A with latency x, and instruction B with latency L1.
We know the latency L1. We are looking for latency x.
But we also know the latency of *serially* executing instruction A and
instruction B - latency L12;
Given that the execution was serial, is it correct to compute latency of
instruction A
as  x = L12 - L1  ?
I.e. the real latency of BT32rr is 11.559-11.288 = 0.271 == 1?
Quuxplusone commented 5 years ago

I.e. the real latency of BT32rr is 11.559-11.288 = 0.271 == 1?

That is correct. At one point Guillaume (cced) was looking into forming a hierarchy of measurements to make sure that we always had the latency for the back-to-back instruction.

Quuxplusone commented 5 years ago

Indeed. I had one version of llvm-exegesis which computed the dependency graph but since it evaluated everything upfront it would take a lot more time to execute.

We ended up not offering this option to keep it simple and assumed that it would be best to solve this as a post process.

Quuxplusone commented 5 years ago
(In reply to Clement Courbet from comment #2)
> > I.e. the real latency of BT32rr is 11.559-11.288 = 0.271 == 1?
>
> That is correct.
Aha! So the target latency of the instruction is per_snippet_value - (sum of
actual latencies of other instructions in that snippet).

> At one point Guillaume (cced) was looking into forming a
> hierarchy of measurements to make sure that we always had the latency for
> the back-to-back instruction.

Anything i should be aware of? I suspect this might the next big issue
i have with llvm-exegesis, that i'd like to resolve/to be resolved..
Quuxplusone commented 5 years ago

https://reviews.llvm.org/D60000