[MMX] POSTRAScheduler disarrange emms and mmx instruction

Quuxplusone commented 6 years ago


Bugzilla Link	PR35982
Status	NEW
Importance	P enhancement
Reported by	Ilia Taraban (ilia.taraban@intel.com)
Reported on	2018-01-17 01:36:37 -0800
Last modified on	2020-08-30 15:19:29 -0700
Version	trunk
Hardware	PC Linux
CC	Andrei.l.grischenko@intel.com, craig.topper@gmail.com, efriedma@quicinc.com, gonzalo.gadeschi@gmail.com, jyknight@google.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, richard-llvm@metafoo.co.uk, tstellar@redhat.com
Fixed by commit(s)
Attachments
Blocks	PR42319
Blocked by
See also	PR36354

POSTRAScheduler rearrange emms and mmx instruction, so we receive wrong result:

================= main.c ==============
#include <stdio.h>
#include <x86intrin.h>
float sum(__m64);
int main()
{
    float result;
    __m64 x = (_mm_set_pi32(5, 3));
    result = sum(x);
    printf("5 + 3 = %f\n", result);
    _mm_empty();

    return 0;
}

=======================================
================= nice.c ==============
#include <x86intrin.h>

float sum(__m64 x)
{
    double t;
    int x1, x2;

    x1 = _mm_cvtsi64_si32(x);
    x2 = _mm_cvtsi64_si32(_mm_unpackhi_pi32(x, x));

    _mm_empty();

    t = (float)x1 + (float)x2;

    return t;
}
=======================================

>>> clang -v
clang version 7.0.0 (trunk 322555)
Target: x86_64-unknown-linux-gnu
Thread model: posix
...

>>> clang -m32 -O0 -o nice.exe main.c nice.c
>>> ./nice.exe
5 + 3 = 8.000000

>>> clang -m32 -O2 -o nice.exe main.c nice.c
>>> ./nice.exe
5 + 3 = -nan

>>> clang -m32 -O2 -o nice.exe main.c nice.c -mllvm -opt-bisect-limit=194 &&
./nice.exe
...
BISECT: running pass (193) Tail Duplication on function (sum)
BISECT: running pass (194) Machine Copy Propagation Pass on function (sum)
BISECT: NOT running pass (195) Post RA top-down list latency scheduler on
function (sum)
BISECT: NOT running pass (196) Branch Probability Basic Block Placement on
function (sum)
...
5 + 3 = 8

>>> clang -m32 -O2 -o nice.exe main.c nice.c -mllvm -opt-bisect-limit=195 &&
./nice.exe
...
BISECT: running pass (193) Tail Duplication on function (sum)
BISECT: running pass (194) Machine Copy Propagation Pass on function (sum)
BISECT: running pass (195) Post RA top-down list latency scheduler on function
(sum)
BISECT: NOT running pass (196) Branch Probability Basic Block Placement on
function (sum)
BISECT: NOT running pass (197) X86 Execution Dependency Fix on function (sum)
...
5 + 3 = -nan

Let's look at ASM before and after POSTRAScheduler:
=============== nice-194.s ============
    cvtsi2ssl   %eax, %xmm0
    movq    8(%esp), %mm0
    emms
    punpckhdq   %mm0, %mm0      # mm0 = mm0[1,1]
    movd    %mm0, %ecx
    cvtsi2ssl   %ecx, %xmm1
=======================================

POSTRAScheduler changed order of operations to:

=============== nice-195.s ============
    movq    8(%esp), %mm0
    punpckhdq   %mm0, %mm0      # mm0 = mm0[1,1]
    movd    %mm0, %ecx
    emms
    cvtsi2ssl   %eax, %xmm0
    cvtsi2ssl   %ecx, %xmm1
=======================================

So now emms is placed before mmx operation, and as a result, we receive wrong
answer.

Quuxplusone commented 6 years ago

Oops, I mixed up nice-194.s and nice-195.s.

So it should be:

=============== nice-194.s ============
    movq    8(%esp), %mm0
    punpckhdq   %mm0, %mm0      # mm0 = mm0[1,1]
    movd    %mm0, %ecx
    emms
    cvtsi2ssl   %eax, %xmm0
    cvtsi2ssl   %ecx, %xmm1
=======================================
=============== nice-195.s ============
    cvtsi2ssl   %eax, %xmm0
    movq    8(%esp), %mm0
    emms
    punpckhdq   %mm0, %mm0      # mm0 = mm0[1,1]
    movd    %mm0, %ecx
    cvtsi2ssl   %ecx, %xmm1
=======================================

Quuxplusone commented 6 years ago

Simon, do we just need to mark EMMS/FEMMS as defing the MMX registers? That seems to at least fix this case. Maybe mark it as using them too?

Quuxplusone commented 6 years ago

(In reply to Craig Topper from comment #2)
> Simon, do we just need to mark EMMS/FEMMS as defing the MMX registers? That
> seems to at least fix this case. Maybe mark it as using them too?

Yes, and any x87 registers as well. I was considering just making them
terminators to prevent any crossover - even though that'd affect all
instructions not just x87/mmx.

Quuxplusone commented 6 years ago

(In reply to Craig Topper from comment #2)
> Simon, do we just need to mark EMMS/FEMMS as defing the MMX registers? That
> seems to at least fix this case. Maybe mark it as using them too?

I don't think either of those is sufficient, given the definitions would be
dead (so you're effectively just clobbering the registers).

The x86-64 ABI says "The CPU shall be in x87 mode upon entry to a function."
i.e. the x87 tag word should be set to all ones.  So the right way to handle
this is to explicitly model the tag word as a register: MMX instrutions clobber
it, emms defines it, and call/return instructions use it.

There are also other potential "scheduling" problems: currently, MMX intrinsics
are marked IntrNoMem, so an IR transform could sink an MMX instruction past an
EMMS.  But that's sort of orthogonal to the MachineInstr modeling.

Quuxplusone commented 5 years ago

Is this a duplicate of https://bugs.llvm.org/show_bug.cgi?id=15388 ? It does looks suspiciously similar.

I believe this bug causes undefined behavior in Rust code using MMX registers. We have had a lot of recurring bugs in linux, windows, and macos targets (mostly on 32-bit targets) over the last two years where suddenly some floating-point tests would intermittently fail in some systems and a couple of compiler builds later the failures would disappear just to reappear some time later.

We have started to manually emit emms using inline assembly and that has fixed some of these intermittent issues for good, but this is probably something worth fixing at the LLVM level.

Frontends using the floating-point LLVM intrinsics should have to use inline assembly to avoid undefined behavior.

Quuxplusone commented 5 years ago

https://reviews.llvm.org/D57298

Quuxplusone commented 5 years ago

A partial fix to clobber MM0-7 and ST0-7 on emms/femms was commited in r352642. This prevents the postRA scheduler from affecting the test case here.

This is not a complete fix and it can break in other ways.

Quuxplusone commented 5 years ago

The fix committed in r352642 was reverted in r352782. It was recommitted in r353016. Hopefully it will stick this time.

Quuxplusone commented 4 years ago

Richard Smith came up with this test (i've slightly modified it for clarity).
Compile for 64bit, -O2 -- https://godbolt.org/z/drn6zs.

=====
#include <mmintrin.h>
void g(__m64);

long long f() {
  volatile long long a = {0}, b = {0};
// x87 usage
  volatile long double d = 1.0;
  long long aload = a, bload = b;
  double dload = d;
  bool s = __builtin_signbitl(dload);
// MMX usage
  __m64 c = _m_punpckhbw (_mm_cvtsi64_m64(aload), _mm_cvtsi64_m64(bload));
  if (s) g(c);
  long long result = _mm_cvtm64_si64(c);
  _mm_empty();
// MMX cleared
  return result;
}

=====

_Z1fv: # @_Z1fv
  subq $72, %rsp
  movq $0, 24(%rsp)
  movq $0, 16(%rsp)
;; X87 instructions
  fld1
  fstpt 48(%rsp)
;; MMX instructions
  movq 24(%rsp), %mm0
  punpckhbw 16(%rsp), %mm0 # mm0 = mm0[4],mem[4],mm0[5],mem[5],mm0[6],mem[6],mm0[7],mem[7]
;; BROKEN: X87 instructions again...
  fldt 48(%rsp)
  fstpt 32(%rsp)
  movswq 40(%rsp), %rax
  testq %rax, %rax
  jns .LBB0_2
;; MMX instructions again...
  movq2dq %mm0, %xmm0
  movq %mm0, 8(%rsp) # 8-byte Spill
  callq _Z1gDv1_x
  movq 8(%rsp), %mm0 # 8-byte Reload
.LBB0_2:
  movq %mm0, %rax
;; Clear MMX state, safe to use x87 again.
  emms
  addq $72, %rsp
  retq

Quuxplusone commented 4 years ago

James, this most recent issue isn't unique to llvm is it? I think gcc had the same issue before they switched to converting all mmx operations to SSE2 in recent versions of gcc.

Quuxplusone / LLVMBugzillaTest

[MMX] POSTRAScheduler disarrange emms and mmx instruction #34955