Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Moffs operands are not encoded #33705

Open Quuxplusone opened 7 years ago

Quuxplusone commented 7 years ago
Bugzilla Link PR34733
Status NEW
Importance P enhancement
Reported by Kadir Cetinkaya (kadircetinkaya.06.tr@gmail.com)
Reported on 2017-09-26 02:43:14 -0700
Last modified on 2017-11-19 05:07:08 -0800
Version 4.0
Hardware PC Linux
CC craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, peter@cordes.ca
Fixed by commit(s)
Attachments
Blocks PR10988
Blocked by
See also
LLVM is not able to generate instructions with moffs operands, when I try to
assemble

mov al, [0x01]

with llvm-mc it generates encoding, [0x8a,0x04,0x25,0x01,0x00,0x00,0x00] which
is an instruction that actually uses full modr/m and sib. Which should decode
into mov al, byte ptr [0x01].

It should've generated  "\xa0\x01\x00\x00\x00\x00\x00\x00\x00". As mentioned in
page 45 of Intel SDM, Instruction Set Reference, July 2017.

"
2.2.1.4 Direct Memory-Offset MOVs
In 64-bit mode, direct memory-offset forms of the MOV instruction are extended
to specify a 64-bit immediate absolute address. This address is called a
moffset. No prefix is needed to specify this 64-bit memory offset. For these
MOV instructions, the size of the memory offset follows the address-size
default (64 bits in 64-bit mode). See Table 2-6.
"

Somehow, http://shell-storm.org/online/Online-Assembler-and-
Disassembler/?inst=mov+al%2C+%5B0x01%5D&arch=x86-64#assembly this assembler
manages to do it even tough they are using Keystone and Capstone. May be it was
correct in previous versions, and got broken in a later release?
Quuxplusone commented 7 years ago
In 64-bit mode, 8a 04 25 disp32 is the optimal encoding for an absolute address
that fits in a sign-extended 32-bit constant.  It's only 7 bytes instead of 9
for the mov al, moffs encoding.

AT&T mnemonics use movabs to explicitly specify the 64-bit moffs encoding (or
mov r64, imm64).  clang requires movabs for 64-bit addresses, and always
assembles it to the moffs encoding even if ModRM+disp32 would be shorter.

IDK why the online assembler you linked is failing to make the optimization
when you used mov instead of movabs.

There is a semi-bug bug here, though.  clang's assembler truncates addresses
that *don't* fit with no warning.  as (from GNU binutils (aka gas) uses movabs
/ moffs when needed.  IDK if that was always the case, or if gas also used to
truncate instead of optimizing based on the address.  (But there's also the
case where the address isn't known until link time, if we're considering code
models other than the default where all static code/data symbols are in the low
2G).

NASM warns when a constant won't fit into an immediate; clang should at least
warn, or follow the gas behaviour of promoting to movabs.

.intel_syntax noprefix
    mov    eax, [0x123456789ABC]    ### broken with clang
    mov    eax, [0x1]             # uses mod/rm
    movabs eax, [0x1]             # explicitly request moffs encoding
    movabs eax, [0x123456789ABC]

assembles to this with clang, truncating the 64-bit address in the first
instruction:

   0:   8b 04 25 bc 9a 78 56    mov    eax,DWORD PTR ds:0x56789abc
   7:   8b 04 25 01 00 00 00    mov    eax,DWORD PTR ds:0x1
   e:   a1 01 00 00 00 00 00 00 00      movabs eax,ds:0x1
  17:   a1 bc 9a 78 56 34 12 00 00      movabs eax,ds:0x123456789abc

or gas:

   0:   a1 bc 9a 78 56 34 12 00 00      movabs eax,ds:0x123456789abc
   9:   8b 04 25 01 00 00 00            mov    eax,DWORD PTR ds:0x1
  10:   a1 01 00 00 00 00 00 00 00      movabs eax,ds:0x1
  19:   a1 bc 9a 78 56 34 12 00 00      movabs eax,ds:0x123456789abc

In 32-bit code, it's always a win to use movabs / moffs for absolute addresses
instead of a modrm + disp32, but not in 64-bit.
Quuxplusone commented 7 years ago
(In reply to Peter Cordes from comment #1)
> In 64-bit mode, 8a 04 25 disp32 is the optimal encoding for an absolute
> address that fits in a sign-extended 32-bit constant.  It's only 7 bytes
> instead of 9 for the mov al, moffs encoding.
>
I wasn't looking at from optimality perspective just wasn't expecting this one
to be generated, but you are right they perform the same job. But unfortunately
it is not optimal to produce it with ModR/M and SIB, you can still add an
address size override prefix and get a 6 byte representation in such cases. For
example:
mov eax, [0x11] can be encoded with both(moffs versions ofc):
a1 11 00 00 00 00 00 00 00 (you are right this one is 9 bytes)
67 a1 11 00 00 00 (this one is only 6 bytes)

> AT&T mnemonics use movabs to explicitly specify the 64-bit moffs encoding
> (or mov r64, imm64).  clang requires movabs for 64-bit addresses, and always
> assembles it to the moffs encoding even if ModRM+disp32 would be shorter.
>
As mentioned this seems like never the case but I see your point. There is an
ambiguity.

> IDK why the online assembler you linked is failing to make the optimization
> when you used mov instead of movabs.
>
>
Yeah, most likely an older version of the repo.

> There is a semi-bug bug here, though.  clang's assembler truncates addresses
> that *don't* fit with no warning.  as (from GNU binutils (aka gas) uses
> movabs / moffs when needed.  IDK if that was always the case, or if gas also
> used to truncate instead of optimizing based on the address.  (But there's
> also the case where the address isn't known until link time, if we're
> considering code models other than the default where all static code/data
> symbols are in the low 2G).
>
> NASM warns when a constant won't fit into an immediate; clang should at
> least warn, or follow the gas behaviour of promoting to movabs.
>
> .intel_syntax noprefix
>     mov    eax, [0x123456789ABC]    ### broken with clang
>     mov    eax, [0x1]             # uses mod/rm
>     movabs eax, [0x1]             # explicitly request moffs encoding
>     movabs eax, [0x123456789ABC]
>
> assembles to this with clang, truncating the 64-bit address in the first
> instruction:
>
>    0:   8b 04 25 bc 9a 78 56    mov    eax,DWORD PTR ds:0x56789abc
>    7:   8b 04 25 01 00 00 00    mov    eax,DWORD PTR ds:0x1
>    e:   a1 01 00 00 00 00 00 00 00      movabs eax,ds:0x1
>   17:   a1 bc 9a 78 56 34 12 00 00      movabs eax,ds:0x123456789abc
>
> or gas:
>
>    0:   a1 bc 9a 78 56 34 12 00 00      movabs eax,ds:0x123456789abc
>    9:   8b 04 25 01 00 00 00            mov    eax,DWORD PTR ds:0x1
>   10:   a1 01 00 00 00 00 00 00 00      movabs eax,ds:0x1
>   19:   a1 bc 9a 78 56 34 12 00 00      movabs eax,ds:0x123456789abc
>
>
> In 32-bit code, it's always a win to use movabs / moffs for absolute
> addresses instead of a modrm + disp32, but not in 64-bit.

Well, as I've mentioned above and if adding addr32 does not cause any unknown
behaviours(couldn't find anything on intel sdm shouldn't be the case). Using
movabs with moffs is always optimal to modrm + disp

In disp32(address is smaller than 2^32) case with movabs it only takes 6 bytes
and
In disp64(address is bigger than 2^32) case with movabs it only takes 9 bytes,
where as it is not possible with modr/m.

I've successfully run the following code on a haswell machine without any
problems:
0000000000400477 <main>:
  400477:   55                      push   %rbp
  400478:   48 89 e5                mov    %rsp,%rbp
  40047b:   67 a0 7b 04 40 00       addr32 mov 0x40047b,%al
  400481:   b8 00 00 00 00          mov    $0x0,%eax
  400486:   5d                      pop    %rbp
  400487:   c3                      retq
  400488:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  40048f:   00

So maybe clang should always prefer movabs when it is performing an access with
only displacement?
Quuxplusone commented 7 years ago
(In reply to Kadir Cetinkaya from comment #2)
> Well, as I've mentioned above and if adding addr32 does not cause any
> unknown behaviours(couldn't find anything on intel sdm shouldn't be the
> case).

32-bit addresses (from using a 0x67 address-size prefix) are zero-extended to
64-bit, so this does work in the special case where the destination register is
al/rax (otherwise movabs isn't even an option).

Good idea but unfortunately there's a performance downside on Intel CPUs.
Unless the assembler also takes account of tune=bdver2 vs. tune=haswell, this
is a bad idea.  (It should be a win for K8/K10/BD, and maybe Ryzen).

A 67 prefix changes the length of the *rest* of the instruction, so this is a
case of the dreaded Length-Changing Prefix.  It does in fact cause a pre-decode
stall here on Skylake.  Just like add cx, 12345  (67 add modrm imm16, instead
of add modrm imm32).  See http://agner.org/optimize/ microarch pdf, and search
for "length-changing prefixes".

It seems Agner forgot about moffs, and says that the addr32 prefix isn't length-
changing in 64-bit mode.  He says (for SnB): An address size prefix does not
cause a penalty, even if it changes the length of the instruction.  He maybe
only tested in 32-bit mode (where the prefix switches to 16-bit address-size),
or at least didn't test with movabs.

I tested a loop with NASM, on SKL i7-6700k.  I repeated the instruction a huge
number of times to defeat the uop cache, to test decoding.  (You'd see the same
bottleneck always on Core2/Nehalem, because they don't have a uop cache but
still have LCP stalls.)  In *some* cases, like inside a small loop, 67 movabs
could be a win, because it will only decode once and execute many times.  So if
the code-size saving gains us anything, the cost is amortized over many
iterations.

    align 32
    loop:
    times 10240     mov  eax, [abs buf]
                    ; modrm encoding:  8b 04 25 40 19 61 00
    dec   ebp
    jnz   .loop

That runs at 2 loads per clock, measured with performance counters using
ocperf.py (a wrapper around Linux perf).

taskset -c 3 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-
faults,cycles,instructions,branches,uops_issued.any,uops_executed.thread,ild_stall.lcp
-r2 ./testloop

       1313.898619      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.00% )
                 2      context-switches          #    0.002 K/sec                    ( +- 50.00% )
                 0      cpu-migrations            #    0.000 K/sec
                 4      page-faults               #    0.003 K/sec
     5,123,291,918      cycles                    #    3.899 GHz                      ( +-  0.02% )
    10,243,978,854      instructions              #    2.00  insn per cycle           ( +-  0.00% )
         1,349,684      branches                  #    1.027 M/sec                    ( +-  0.03% )
    10,254,586,195      uops_issued_any           # 7804.701 M/sec                    ( +-  0.00% )
    10,254,726,690      uops_executed_thread      # 7804.808 M/sec                    ( +-  0.00% )
               120      ild_stall_lcp             #    0.091 K/sec                    ( +-  1.26% )

Negligible (32) Instruction-Length-Decoding LCP stalls.

--------

    times 10240 a32 mov  eax, [abs buf]
     ;  67 a1 40 f1 60 00       addr32 mov eax,ds:0x60f140

Runs at 0.3 loads per clock, bottlenecked on the front-end.

       8866.655742      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.02% )
                 6      context-switches          #    0.001 K/sec                    ( +-  9.09% )
                 0      cpu-migrations            #    0.000 K/sec
                 3      page-faults               #    0.000 K/sec
    34,578,576,878      cycles                    #    3.900 GHz                      ( +-  0.02% )
    10,254,843,120      instructions              #    0.30  insn per cycle           ( +-  0.00% )
         3,268,110      branches                  #    0.369 M/sec                    ( +-  0.24% )
    10,266,969,376      uops_issued_any           # 1157.930 M/sec                    ( +-  0.00% )
    10,268,067,888      uops_executed_thread      # 1158.054 M/sec                    ( +-  0.00% )
    30,720,269,301      ild_stall_lcp             # 3464.696 M/sec                    ( +-  0.00% )

       8.866540734 seconds time elapsed                                          ( +-  0.02% )

Note the ild_stall_lcp performance counter being almost equal to cycles.  (And
surprisingly higher than the uop count).