Open Quuxplusone opened 10 years ago
Attached missing_folding.ll
(1132 bytes, application/octet-stream): IR to reproduce the problem
So, I think this is as fixed as it can be.
To fold a memory operand here *requires* domain crossing with SSE. It isn't
until AVX that we get VPERMILPS which lets us fold the memory operand in-
domain, and the new code should be getting it there. Do you agree Quentin?
(In reply to comment #1)
> So, I think this is as fixed as it can be.
>
> To fold a memory operand here *requires* domain crossing with SSE.
I am not sure I got that.
Are you saying that doing:
movaps (%rdi), %xmm0
shufps $27, %xmm0, %xmm0
is faster than doing:
shufps $27, (%rdi), %xmm0
I certainly miss something, but aren't both code sequence would have the same
domain crossing "pattern"?
I would have expect both to run as the same speed. The advantage of the second
being more compact and using less resources in the pipeline (decoding, etc.)
Also, the old code sequence in fact does everything on the integer domain:
pshufd $27, (%rdi), %xmm0 ## xmm0 = mem[3,2,1,0]
movdqa %xmm0, 48(%rsi) <-- movdqa instead of movaps.
That said, for this case, I do not expect the lowering to "move" the whole
chain of computation on another domain. I think this is the job of the domain
fixer (or global isle for what matters!).
(In reply to comment #2)
> (In reply to comment #1)
> > So, I think this is as fixed as it can be.
> >
> > To fold a memory operand here *requires* domain crossing with SSE.
>
> I am not sure I got that.
> Are you saying that doing:
> movaps (%rdi), %xmm0
> shufps $27, %xmm0, %xmm0
>
> is faster than doing:
> shufps $27, (%rdi), %xmm0
No, the second has a different result from the first. The second will blend the
elements in (%rdi) with those already in %xmm0. The first replaces the contents
of %xmm0 with those in (%rdi) and then permutes them.
AVX gives you a permute operation in the floating point domain with vpermilps.
>
> I certainly miss something, but aren't both code sequence would have the
> same domain crossing "pattern"?
> I would have expect both to run as the same speed. The advantage of the
> second being more compact and using less resources in the pipeline
> (decoding, etc.)
>
> Also, the old code sequence in fact does everything on the integer domain:
> pshufd $27, (%rdi), %xmm0 ## xmm0 = mem[3,2,1,0]
> movdqa %xmm0, 48(%rsi) <-- movdqa instead of movaps.
>
> That said, for this case, I do not expect the lowering to "move" the whole
> chain of computation on another domain. I think this is the job of the
> domain fixer (or global isle for what matters!).
Exactly. I think operation chain domain fixing and shuffle combining should be
handled elsewhere.
(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > So, I think this is as fixed as it can be.
> > >
> > > To fold a memory operand here *requires* domain crossing with SSE.
> >
> > I am not sure I got that.
> > Are you saying that doing:
> > movaps (%rdi), %xmm0
> > shufps $27, %xmm0, %xmm0
> >
> > is faster than doing:
> > shufps $27, (%rdi), %xmm0
>
> No, the second has a different result from the first. The second will blend
> the elements in (%rdi) with those already in %xmm0. The first replaces the
> contents of %xmm0 with those in (%rdi) and then permutes them.
Ah... right! That is what I missed. I forgot that shufps uses both arguments as
input, whereas pshufd, uses just one.
Sorry for the noise.
>
> AVX gives you a permute operation in the floating point domain with
> vpermilps.
>
> >
> > I certainly miss something, but aren't both code sequence would have the
> > same domain crossing "pattern"?
> > I would have expect both to run as the same speed. The advantage of the
> > second being more compact and using less resources in the pipeline
> > (decoding, etc.)
> >
> > Also, the old code sequence in fact does everything on the integer domain:
> > pshufd $27, (%rdi), %xmm0 ## xmm0 = mem[3,2,1,0]
> > movdqa %xmm0, 48(%rsi) <-- movdqa instead of movaps.
> >
> > That said, for this case, I do not expect the lowering to "move" the whole
> > chain of computation on another domain. I think this is the job of the
> > domain fixer (or global isle for what matters!).
>
> Exactly. I think operation chain domain fixing and shuffle combining should
> be handled elsewhere.
Changing the title of the PR.
Thanks for your help!!
missing_folding.ll
(1132 bytes, application/octet-stream)