Open Quuxplusone opened 5 years ago
Bugzilla Link | PR43724 |
Status | NEW |
Importance | P normal |
Reported by | Xingbo Wu (wuxb45@gmail.com) |
Reported on | 2019-10-19 17:18:43 -0700 |
Last modified on | 2019-10-20 09:11:32 -0700 |
Version | 9.0 |
Hardware | PC Linux |
CC | blitzrakete@gmail.com, craig.topper@gmail.com, dblaikie@gmail.com, dgregor@apple.com, erik.pilkington@gmail.com, llvm-bugs@lists.llvm.org, richard-llvm@metafoo.co.uk |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
The primary change that would have affected the use of lock inc/dec over lock add/sub is this one https://reviews.llvm.org/D58412. I don't see a way to restore the old behavior without reverting that change and recompiling the compiler from source.
I'm not aware of any changed that would cause us to use xadd over add in clang-9.0. The only time we should be generating xadd is when the previous value is needed. And if the previous value is used by a compare, we try to get the answer from the flags of a regular lock add/sub/inc/dec if possible. Do you have an example where the behavior changed? I took your source file from github and put it in godbolt, and saw one xadd in both clang 9.0 and clang 8.0.
There was also a bug fix for the encoding of immediates on a lock addq where we would use a 4 byte immediate instruction when the immediate would fit in 1 byte. That bug is specific to addq and did not effect addl, but wanted to mention it.
The difference in the code is between 'lock addl' and 'lock incl'. That xadd is from an unrelated function.
If you change those atomic_fetch_add() and atomic_fetch_sub() to use +2 or +3, instead of +1, you will see xadd with clang-9 and still addl with clang-8. (if I remember correctly).
I'm not sure if the incl is the root cause. This report is targeting the performance regression, not "Mommy I want addl".
If incl can be still slower even if in a few rare cases, then SlowIncDec should be kept. Compared to those 14-byte long nops and comprehensive loop-unrollings, I don't feel that one- or two-byte savings has that level of priority with -O3.
Or, some compiler options or "pragma use_addl" can also help. (but I don't feel that's the right way to fix it).
After adding (and then removing) __builtin_expect at a few places, the performance results have some changes but still, none can get close to the level that clang-8 produced. Anyway, this suggests that branch-prediction and other optimizations cannot be ruled out from the problem.