Changing the domain of the chain of computation allows memory folding in the shuffle.

Quuxplusone commented 10 years ago


Bugzilla Link	PR21137
Status	NEW
Importance	P normal
Reported by	Quentin Colombet (quentin.colombet@gmail.com)
Reported on	2014-10-02 15:06:56 -0700
Last modified on	2014-10-03 11:46:20 -0700
Version	trunk
Hardware	PC All
CC	andrea.dibiagio@gmail.com, chandlerc@gmail.com, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments	`missing_folding.ll` (1132 bytes, application/octet-stream)
Blocks
Blocked by
See also

Created attachment 13121
IR to reproduce the problem

When using the new vector shuffle lowering, some folding opportunities are
missed.

To reproduce:
llc -x86-experimental-vector-shuffle-lowering=true missing_folding.ll -o new.s
llc -x86-experimental-vector-shuffle-lowering=false missing_folding.ll -o old.s

diff -U 10 old.s new.s
-   pshufd  $27, (%rdi), %xmm0      ## xmm0 = mem[3,2,1,0]
+   movaps  (%rdi), %xmm0
+   shufps  $27, %xmm0, %xmm0       ## xmm0 = xmm0[3,2,1,0]

The shuffle instruction used in the new lowering is on a different than the old
one (different domain). Thus, it is likely we need to teach how to fold the
operand for that one too.

Quuxplusone commented 10 years ago

Attached missing_folding.ll (1132 bytes, application/octet-stream): IR to reproduce the problem

Quuxplusone commented 10 years ago

So, I think this is as fixed as it can be.

To fold a memory operand here *requires* domain crossing with SSE. It isn't
until AVX that we get VPERMILPS which lets us fold the memory operand in-
domain, and the new code should be getting it there. Do you agree Quentin?

Quuxplusone commented 10 years ago

(In reply to comment #1)
> So, I think this is as fixed as it can be.
>
> To fold a memory operand here *requires* domain crossing with SSE.

I am not sure I got that.
Are you saying that doing:
    movaps  (%rdi), %xmm0
    shufps  $27, %xmm0, %xmm0

is faster than doing:
    shufps  $27, (%rdi), %xmm0

I certainly miss something, but aren't both code sequence would have the same
domain crossing "pattern"?
I would have expect both to run as the same speed. The advantage of the second
being more compact and using less resources in the pipeline (decoding, etc.)

Also, the old code sequence in fact does everything on the integer domain:
    pshufd  $27, (%rdi), %xmm0      ## xmm0 = mem[3,2,1,0]
    movdqa  %xmm0, 48(%rsi) <-- movdqa instead of movaps.

That said, for this case, I do not expect the lowering to "move" the whole
chain of computation on another domain. I think this is the job of the domain
fixer (or global isle for what matters!).

Quuxplusone commented 10 years ago

(In reply to comment #2)
> (In reply to comment #1)
> > So, I think this is as fixed as it can be.
> >
> > To fold a memory operand here *requires* domain crossing with SSE.
>
> I am not sure I got that.
> Are you saying that doing:
>   movaps  (%rdi), %xmm0
>   shufps  $27, %xmm0, %xmm0
>
> is faster than doing:
>   shufps  $27, (%rdi), %xmm0

No, the second has a different result from the first. The second will blend the
elements in (%rdi) with those already in %xmm0. The first replaces the contents
of %xmm0 with those in (%rdi) and then permutes them.

AVX gives you a permute operation in the floating point domain with vpermilps.

>
> I certainly miss something, but aren't both code sequence would have the
> same domain crossing "pattern"?
> I would have expect both to run as the same speed. The advantage of the
> second being more compact and using less resources in the pipeline
> (decoding, etc.)
>
> Also, the old code sequence in fact does everything on the integer domain:
>   pshufd  $27, (%rdi), %xmm0      ## xmm0 = mem[3,2,1,0]
>   movdqa  %xmm0, 48(%rsi) <-- movdqa instead of movaps.
>
> That said, for this case, I do not expect the lowering to "move" the whole
> chain of computation on another domain. I think this is the job of the
> domain fixer (or global isle for what matters!).

Exactly. I think operation chain domain fixing and shuffle combining should be
handled elsewhere.

Quuxplusone commented 10 years ago

(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > So, I think this is as fixed as it can be.
> > >
> > > To fold a memory operand here *requires* domain crossing with SSE.
> >
> > I am not sure I got that.
> > Are you saying that doing:
> >     movaps  (%rdi), %xmm0
> >     shufps  $27, %xmm0, %xmm0
> >
> > is faster than doing:
> >     shufps  $27, (%rdi), %xmm0
>
> No, the second has a different result from the first. The second will blend
> the elements in (%rdi) with those already in %xmm0. The first replaces the
> contents of %xmm0 with those in (%rdi) and then permutes them.

Ah... right! That is what I missed. I forgot that shufps uses both arguments as
input, whereas pshufd, uses just one.

Sorry for the noise.

>
> AVX gives you a permute operation in the floating point domain with
> vpermilps.
>
> >
> > I certainly miss something, but aren't both code sequence would have the
> > same domain crossing "pattern"?
> > I would have expect both to run as the same speed. The advantage of the
> > second being more compact and using less resources in the pipeline
> > (decoding, etc.)
> >
> > Also, the old code sequence in fact does everything on the integer domain:
> >     pshufd  $27, (%rdi), %xmm0      ## xmm0 = mem[3,2,1,0]
> >     movdqa  %xmm0, 48(%rsi) <-- movdqa instead of movaps.
> >
> > That said, for this case, I do not expect the lowering to "move" the whole
> > chain of computation on another domain. I think this is the job of the
> > domain fixer (or global isle for what matters!).
>
> Exactly. I think operation chain domain fixing and shuffle combining should
> be handled elsewhere.

Changing the title of the PR.

Thanks for your help!!

Quuxplusone / LLVMBugzillaTest

Changing the domain of the chain of computation allows memory folding in the shuffle. #21136