Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Code generated by clang-9 runs slower than those with previous versions (clang-8/clang-7). #42694

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR43724
Status NEW
Importance P normal
Reported by Xingbo Wu (wuxb45@gmail.com)
Reported on 2019-10-19 17:18:43 -0700
Last modified on 2019-10-20 09:11:32 -0700
Version 9.0
Hardware PC Linux
CC blitzrakete@gmail.com, craig.topper@gmail.com, dblaikie@gmail.com, dgregor@apple.com, erik.pilkington@gmail.com, llvm-bugs@lists.llvm.org, richard-llvm@metafoo.co.uk
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
Performance regression observed after upgraded to clang-9 on Archlinux
x86_64/Broadwell (all packages are up-to-date).

The regression has been bisected to the scope of a rwlock implementation with
c11 atomic operations. Code that can reproduce the regression on my a few
Broadwell servers can be found here: https://github.com/wuxb45/atomictest

The most apparent difference is that clang-9 generates "lock incl" for +1
operations and "lock xadd" for +n operations. clang-8 and clang-7 all generate
"lock addl".

The rest of the assembly looks similar but still show some differences on the
ordering of some if-else blocks (just a few places). It is possible that the
branch-prediction are unluckily affected by the changed layout. However, the
code in the above link is excerpted from a larger code base after the
regression was observed. Both versions of the code shows the same behavior and
I still believe it's not _sustainable_ trying to mitigate this issue by
reordering some source code.

The regression has been consistently observed on Xeon E5 2697A v4 (HT off,
broadwell) and Xeon Gold 5120 (HT on, skylake). I have not tried it on other
CPUs.

More details are provided in that github link. Besides, I'm willing to move
some texts here or provide anything else that could be useful.

Appreciate it.
Quuxplusone commented 5 years ago

The primary change that would have affected the use of lock inc/dec over lock add/sub is this one https://reviews.llvm.org/D58412. I don't see a way to restore the old behavior without reverting that change and recompiling the compiler from source.

I'm not aware of any changed that would cause us to use xadd over add in clang-9.0. The only time we should be generating xadd is when the previous value is needed. And if the previous value is used by a compare, we try to get the answer from the flags of a regular lock add/sub/inc/dec if possible. Do you have an example where the behavior changed? I took your source file from github and put it in godbolt, and saw one xadd in both clang 9.0 and clang 8.0.

There was also a bug fix for the encoding of immediates on a lock addq where we would use a 4 byte immediate instruction when the immediate would fit in 1 byte. That bug is specific to addq and did not effect addl, but wanted to mention it.

Quuxplusone commented 5 years ago

The difference in the code is between 'lock addl' and 'lock incl'. That xadd is from an unrelated function.

If you change those atomic_fetch_add() and atomic_fetch_sub() to use +2 or +3, instead of +1, you will see xadd with clang-9 and still addl with clang-8. (if I remember correctly).

I'm not sure if the incl is the root cause. This report is targeting the performance regression, not "Mommy I want addl".

If incl can be still slower even if in a few rare cases, then SlowIncDec should be kept. Compared to those 14-byte long nops and comprehensive loop-unrollings, I don't feel that one- or two-byte savings has that level of priority with -O3.

Or, some compiler options or "pragma use_addl" can also help. (but I don't feel that's the right way to fix it).

Quuxplusone commented 5 years ago

After adding (and then removing) __builtin_expect at a few places, the performance results have some changes but still, none can get close to the level that clang-8 produced. Anyway, this suggests that branch-prediction and other optimizations cannot be ruled out from the problem.