Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Usage of a copy of a register just after a mov instruction #18604

Open Quuxplusone opened 10 years ago

Quuxplusone commented 10 years ago
Bugzilla Link PR18605
Status NEW
Importance P normal
Reported by Jasper Neumann (jn@sirrida.de)
Reported on 2014-01-24 11:47:52 -0800
Last modified on 2014-02-12 20:05:28 -0800
Version trunk
Hardware PC Linux
CC atrick@apple.com, chandlerc@gmail.com, echristo@gmail.com, geek4civic@gmail.com, grosbach@apple.com, hfinkel@anl.gov, jn@sirrida.de, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also PR18602, PR18603, PR18604
Using a copy of a register just after a mov instruction may cost an extra cycle
on all but the very newest x86 processors (4th generation Intel Core).
On superscalar processors a mov instruction and a modification of the *source*
register thereof can be executed in parallel.
Typically for every line of the C code below we can reduce the number of used
cycles from 3 to 2 (measured on e.g. Intel i7 920, Intel Atom N450).
Even if the "correct order" is given the "wrong order" is produced.
Interestingly both GCC and ICC also show this strange behavior;
is there any reason to do it this way?

int test(int x) {
  int y;
  x ^= (x >> 2);
  x = (x >> 3) ^ x;
  x = x ^ (x >> 4);
  y = x;  x >>= 5;  x ^= y;  // almost the same but explicit
  return x;
  }
=>
    movl    %edi, %eax
    sarl    $2, %eax  // => sarl    $2, %edi
    xorl    %edi, %eax
    movl    %eax, %ecx
    sarl    $3, %ecx  // => sarl    $2, %eax
    xorl    %eax, %ecx
    movl    %ecx, %edx
    sarl    $4, %edx  // => sarl    $2, %ecx
    xorl    %ecx, %edx
    movl    %edx, %eax
    sarl    $5, %eax  // => sarl    $2, %edx
    xorl    %edx, %eax
    retq
Quuxplusone commented 10 years ago

I think the "problem" here is that the 2-address pass does not know how to stretch the source live range of a copy. A use of the copy that immediately follows always ends of using the result of the copy. I think there are cases where it would be useful to use the copy's source instead and reduce the critical path. However, in general we would avoid doing that because it potentially increases register pressure/constraints scheduling.

So it's tricky, and there is no motivation for improving it when the optimization target makes copies free.

Quuxplusone commented 10 years ago

It seems like it would be easy enough to adjust for this use-after-copy hazard, for those targets that need it, with a simple late MI pass.

Quuxplusone commented 10 years ago
I would be glad if such a peephole optimization were present.
IMHO it should be active as default at least for x86.