Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

std::clamp generates suboptimal assembly for primitive types #48878

Open Quuxplusone opened 3 years ago

Quuxplusone commented 3 years ago
Bugzilla Link PR49909
Status NEW
Importance P enhancement
Reported by Vittorio Romeo (vittorio.romeo@outlook.com)
Reported on 2021-04-09 13:06:06 -0700
Last modified on 2021-04-12 12:18:47 -0700
Version unspecified
Hardware PC Windows NT
CC ldionne@apple.com, lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, mclow.lists@gmail.com, mizvekov@gmail.com, spatel+llvm@rotateright.com, zilla@kayari.org
Fixed by commit(s) rG077bff39d46364035a5dcfa32fc69910ad0975d0
Attachments
Blocks
Blocked by PR47271
See also PR47271, PR35607

std::clamp generates poor assembly compared to hand-written counterpart for primitive types like float, even with -Ofast -ffast-math:

stdclamp(float, float, float): # @stdclamp(float, float, float) movss dword ptr [rsp - 4], xmm0 movss dword ptr [rsp - 8], xmm1 movss dword ptr [rsp - 12], xmm2 ucomiss xmm2, xmm0 lea rax, [rsp - 12] lea rcx, [rsp - 4] cmovb rcx, rax ucomiss xmm0, xmm1 lea rax, [rsp - 8] cmovae rax, rcx movss xmm0, dword ptr [rax] # xmm0 = mem[0],zero,zero,zero ret

myclamp(float, float, float): # @myclamp(float, float, float) maxss xmm0, xmm1 minss xmm0, xmm2 ret

Live example: https://godbolt.org/z/5oxvocevK

More information on: https://secret.club/2021/04/09/std-clamp.html

Quuxplusone commented 3 years ago
The problem is in SROA, #47271.
There is nothing in libc++ to fix, likely
Quuxplusone commented 3 years ago

libcxx/libstdc++ developers can maybe use __builtin_assume to fix lack of compiler knowledge to optimize this optimally...

Quuxplusone commented 3 years ago
(In reply to David Bolvansky from comment #2)
> libcxx/libstdc++ developers can maybe use __builtin_assume to fix lack of
> compiler knowledge to optimize this optimally...

Could you please expand on that?
Quuxplusone commented 3 years ago

The standard says it's undefined behaviour if hi < lo but the compiler doesn't know that.

Quuxplusone commented 3 years ago
I've fixed the weird ugliness of the needless control flow in
077bff39d46364035a5dcfa32fc69910ad0975d0.
We now end up with:

        .text
        .file   "test.cpp"
        .globl  _Z8stdclampfff                  # -- Begin function _Z8stdclampfff
        .p2align        4, 0x90
        .type   _Z8stdclampfff,@function
_Z8stdclampfff:                         # @_Z8stdclampfff
        .cfi_startproc
# %bb.0:                                # %entry
        vminss  %xmm0, %xmm2, %xmm2
        vcmpltss        %xmm1, %xmm0, %xmm0
        vblendvps       %xmm0, %xmm1, %xmm2, %xmm0
        retq
.Lfunc_end0:
        .size   _Z8stdclampfff, .Lfunc_end0-_Z8stdclampfff
        .cfi_endproc
                                        # -- End function
        .globl  _Z7myclampfff                   # -- Begin function _Z7myclampfff
        .p2align        4, 0x90
        .type   _Z7myclampfff,@function
_Z7myclampfff:                          # @_Z7myclampfff
        .cfi_startproc
# %bb.0:                                # %entry
        vmaxss  %xmm0, %xmm1, %xmm0
        vminss  %xmm0, %xmm2, %xmm0
        retq
.Lfunc_end1:
        .size   _Z7myclampfff, .Lfunc_end1-_Z7myclampfff
        .cfi_endproc
                                        # -- End function
        .section        ".linker-options","e",@llvm_linker_options
        .ident  "clang version 13.0.0 (git@github.com:LebedevRI/llvm-project.git fc530b5e5056976c6d029e67e70fd22191afbf98)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig

Not really sure if we can get to vmaxss from vcmpltss+vblendvps.
Quuxplusone commented 3 years ago
There are probably multiple bugs here:
https://godbolt.org/z/asdavqGEM

We don't/can't propagate 'fast' to the 'select' instructions. See bug 35607.

On the 'myclamp' example, we do manage to get 'fast' on the selects, but we
fail to canonicalize to llvm.maxnum/minnum because the operands are reversed
from what we match with "m_OrdFMax/Min".

On the std::clamp example, the max op is split by the min op, so that requires
more pattern-matching to detect.