Open Quuxplusone opened 7 years ago
Bugzilla Link | PR34733 |
Status | NEW |
Importance | P enhancement |
Reported by | Kadir Cetinkaya (kadircetinkaya.06.tr@gmail.com) |
Reported on | 2017-09-26 02:43:14 -0700 |
Last modified on | 2017-11-19 05:07:08 -0800 |
Version | 4.0 |
Hardware | PC Linux |
CC | craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, peter@cordes.ca |
Fixed by commit(s) | |
Attachments | |
Blocks | PR10988 |
Blocked by | |
See also |
In 64-bit mode, 8a 04 25 disp32 is the optimal encoding for an absolute address
that fits in a sign-extended 32-bit constant. It's only 7 bytes instead of 9
for the mov al, moffs encoding.
AT&T mnemonics use movabs to explicitly specify the 64-bit moffs encoding (or
mov r64, imm64). clang requires movabs for 64-bit addresses, and always
assembles it to the moffs encoding even if ModRM+disp32 would be shorter.
IDK why the online assembler you linked is failing to make the optimization
when you used mov instead of movabs.
There is a semi-bug bug here, though. clang's assembler truncates addresses
that *don't* fit with no warning. as (from GNU binutils (aka gas) uses movabs
/ moffs when needed. IDK if that was always the case, or if gas also used to
truncate instead of optimizing based on the address. (But there's also the
case where the address isn't known until link time, if we're considering code
models other than the default where all static code/data symbols are in the low
2G).
NASM warns when a constant won't fit into an immediate; clang should at least
warn, or follow the gas behaviour of promoting to movabs.
.intel_syntax noprefix
mov eax, [0x123456789ABC] ### broken with clang
mov eax, [0x1] # uses mod/rm
movabs eax, [0x1] # explicitly request moffs encoding
movabs eax, [0x123456789ABC]
assembles to this with clang, truncating the 64-bit address in the first
instruction:
0: 8b 04 25 bc 9a 78 56 mov eax,DWORD PTR ds:0x56789abc
7: 8b 04 25 01 00 00 00 mov eax,DWORD PTR ds:0x1
e: a1 01 00 00 00 00 00 00 00 movabs eax,ds:0x1
17: a1 bc 9a 78 56 34 12 00 00 movabs eax,ds:0x123456789abc
or gas:
0: a1 bc 9a 78 56 34 12 00 00 movabs eax,ds:0x123456789abc
9: 8b 04 25 01 00 00 00 mov eax,DWORD PTR ds:0x1
10: a1 01 00 00 00 00 00 00 00 movabs eax,ds:0x1
19: a1 bc 9a 78 56 34 12 00 00 movabs eax,ds:0x123456789abc
In 32-bit code, it's always a win to use movabs / moffs for absolute addresses
instead of a modrm + disp32, but not in 64-bit.
(In reply to Peter Cordes from comment #1)
> In 64-bit mode, 8a 04 25 disp32 is the optimal encoding for an absolute
> address that fits in a sign-extended 32-bit constant. It's only 7 bytes
> instead of 9 for the mov al, moffs encoding.
>
I wasn't looking at from optimality perspective just wasn't expecting this one
to be generated, but you are right they perform the same job. But unfortunately
it is not optimal to produce it with ModR/M and SIB, you can still add an
address size override prefix and get a 6 byte representation in such cases. For
example:
mov eax, [0x11] can be encoded with both(moffs versions ofc):
a1 11 00 00 00 00 00 00 00 (you are right this one is 9 bytes)
67 a1 11 00 00 00 (this one is only 6 bytes)
> AT&T mnemonics use movabs to explicitly specify the 64-bit moffs encoding
> (or mov r64, imm64). clang requires movabs for 64-bit addresses, and always
> assembles it to the moffs encoding even if ModRM+disp32 would be shorter.
>
As mentioned this seems like never the case but I see your point. There is an
ambiguity.
> IDK why the online assembler you linked is failing to make the optimization
> when you used mov instead of movabs.
>
>
Yeah, most likely an older version of the repo.
> There is a semi-bug bug here, though. clang's assembler truncates addresses
> that *don't* fit with no warning. as (from GNU binutils (aka gas) uses
> movabs / moffs when needed. IDK if that was always the case, or if gas also
> used to truncate instead of optimizing based on the address. (But there's
> also the case where the address isn't known until link time, if we're
> considering code models other than the default where all static code/data
> symbols are in the low 2G).
>
> NASM warns when a constant won't fit into an immediate; clang should at
> least warn, or follow the gas behaviour of promoting to movabs.
>
> .intel_syntax noprefix
> mov eax, [0x123456789ABC] ### broken with clang
> mov eax, [0x1] # uses mod/rm
> movabs eax, [0x1] # explicitly request moffs encoding
> movabs eax, [0x123456789ABC]
>
> assembles to this with clang, truncating the 64-bit address in the first
> instruction:
>
> 0: 8b 04 25 bc 9a 78 56 mov eax,DWORD PTR ds:0x56789abc
> 7: 8b 04 25 01 00 00 00 mov eax,DWORD PTR ds:0x1
> e: a1 01 00 00 00 00 00 00 00 movabs eax,ds:0x1
> 17: a1 bc 9a 78 56 34 12 00 00 movabs eax,ds:0x123456789abc
>
> or gas:
>
> 0: a1 bc 9a 78 56 34 12 00 00 movabs eax,ds:0x123456789abc
> 9: 8b 04 25 01 00 00 00 mov eax,DWORD PTR ds:0x1
> 10: a1 01 00 00 00 00 00 00 00 movabs eax,ds:0x1
> 19: a1 bc 9a 78 56 34 12 00 00 movabs eax,ds:0x123456789abc
>
>
> In 32-bit code, it's always a win to use movabs / moffs for absolute
> addresses instead of a modrm + disp32, but not in 64-bit.
Well, as I've mentioned above and if adding addr32 does not cause any unknown
behaviours(couldn't find anything on intel sdm shouldn't be the case). Using
movabs with moffs is always optimal to modrm + disp
In disp32(address is smaller than 2^32) case with movabs it only takes 6 bytes
and
In disp64(address is bigger than 2^32) case with movabs it only takes 9 bytes,
where as it is not possible with modr/m.
I've successfully run the following code on a haswell machine without any
problems:
0000000000400477 <main>:
400477: 55 push %rbp
400478: 48 89 e5 mov %rsp,%rbp
40047b: 67 a0 7b 04 40 00 addr32 mov 0x40047b,%al
400481: b8 00 00 00 00 mov $0x0,%eax
400486: 5d pop %rbp
400487: c3 retq
400488: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
40048f: 00
So maybe clang should always prefer movabs when it is performing an access with
only displacement?
(In reply to Kadir Cetinkaya from comment #2)
> Well, as I've mentioned above and if adding addr32 does not cause any
> unknown behaviours(couldn't find anything on intel sdm shouldn't be the
> case).
32-bit addresses (from using a 0x67 address-size prefix) are zero-extended to
64-bit, so this does work in the special case where the destination register is
al/rax (otherwise movabs isn't even an option).
Good idea but unfortunately there's a performance downside on Intel CPUs.
Unless the assembler also takes account of tune=bdver2 vs. tune=haswell, this
is a bad idea. (It should be a win for K8/K10/BD, and maybe Ryzen).
A 67 prefix changes the length of the *rest* of the instruction, so this is a
case of the dreaded Length-Changing Prefix. It does in fact cause a pre-decode
stall here on Skylake. Just like add cx, 12345 (67 add modrm imm16, instead
of add modrm imm32). See http://agner.org/optimize/ microarch pdf, and search
for "length-changing prefixes".
It seems Agner forgot about moffs, and says that the addr32 prefix isn't length-
changing in 64-bit mode. He says (for SnB): An address size prefix does not
cause a penalty, even if it changes the length of the instruction. He maybe
only tested in 32-bit mode (where the prefix switches to 16-bit address-size),
or at least didn't test with movabs.
I tested a loop with NASM, on SKL i7-6700k. I repeated the instruction a huge
number of times to defeat the uop cache, to test decoding. (You'd see the same
bottleneck always on Core2/Nehalem, because they don't have a uop cache but
still have LCP stalls.) In *some* cases, like inside a small loop, 67 movabs
could be a win, because it will only decode once and execute many times. So if
the code-size saving gains us anything, the cost is amortized over many
iterations.
align 32
loop:
times 10240 mov eax, [abs buf]
; modrm encoding: 8b 04 25 40 19 61 00
dec ebp
jnz .loop
That runs at 2 loads per clock, measured with performance counters using
ocperf.py (a wrapper around Linux perf).
taskset -c 3 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-
faults,cycles,instructions,branches,uops_issued.any,uops_executed.thread,ild_stall.lcp
-r2 ./testloop
1313.898619 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
2 context-switches # 0.002 K/sec ( +- 50.00% )
0 cpu-migrations # 0.000 K/sec
4 page-faults # 0.003 K/sec
5,123,291,918 cycles # 3.899 GHz ( +- 0.02% )
10,243,978,854 instructions # 2.00 insn per cycle ( +- 0.00% )
1,349,684 branches # 1.027 M/sec ( +- 0.03% )
10,254,586,195 uops_issued_any # 7804.701 M/sec ( +- 0.00% )
10,254,726,690 uops_executed_thread # 7804.808 M/sec ( +- 0.00% )
120 ild_stall_lcp # 0.091 K/sec ( +- 1.26% )
Negligible (32) Instruction-Length-Decoding LCP stalls.
--------
times 10240 a32 mov eax, [abs buf]
; 67 a1 40 f1 60 00 addr32 mov eax,ds:0x60f140
Runs at 0.3 loads per clock, bottlenecked on the front-end.
8866.655742 task-clock (msec) # 1.000 CPUs utilized ( +- 0.02% )
6 context-switches # 0.001 K/sec ( +- 9.09% )
0 cpu-migrations # 0.000 K/sec
3 page-faults # 0.000 K/sec
34,578,576,878 cycles # 3.900 GHz ( +- 0.02% )
10,254,843,120 instructions # 0.30 insn per cycle ( +- 0.00% )
3,268,110 branches # 0.369 M/sec ( +- 0.24% )
10,266,969,376 uops_issued_any # 1157.930 M/sec ( +- 0.00% )
10,268,067,888 uops_executed_thread # 1158.054 M/sec ( +- 0.00% )
30,720,269,301 ild_stall_lcp # 3464.696 M/sec ( +- 0.00% )
8.866540734 seconds time elapsed ( +- 0.02% )
Note the ild_stall_lcp performance counter being almost equal to cycles. (And
surprisingly higher than the uop count).