Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

extra memory load in loop #27739

Closed Quuxplusone closed 3 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR27740
Status RESOLVED FIXED
Importance P normal
Reported by Carrot (carrot@google.com)
Reported on 2016-05-13 16:51:44 -0700
Last modified on 2020-12-28 07:12:02 -0800
Version trunk
Hardware PC Linux
CC davidxl@google.com, dberlin@dberlin.org, echristo@gmail.com, ehsanamiri@gmail.com, florian_hahn@apple.com, george.burgess.iv@gmail.com, hfinkel@anl.gov, kit.barton@gmail.com, llvm-bugs@lists.llvm.org, mkuper@google.com, nemanja.i.ibm@gmail.com, sanjoy@playingwithpointers.com
Fixed by commit(s)
Attachments ExampleForComment20.txt (3026 bytes, text/plain)
Blocks
Blocked by
See also
Compile following code with trunk llvm

$ ~/llvm/obj2/bin/clang --target=powerpc64le-grtev4-linux-gnu -O2 -fno-unroll-
loops -c t17.c -save-temps

struct s{
  int f0;
  int f1;
  float f2;
  float f3;
  float f4;
};

static void bar(struct s *p1, struct s *p2)
{
  if (p1->f0 && p1->f3 < p2->f3)
     p1->f3 = p2->f3;
  else
     p1->f3 = p2->f3;    // p1->f4 = p2->f4;
}

void foo(struct s *p1, struct s *p2, int size)
{
  for (int i=0; i<size; i++)
  {
    p1->f1 += p2->f1;
    p1->f2 -= p2->f2;
    bar(p1, p2);
    p2++;
  }
}

LLVM generates following code for the loop body

.LBB0_2:                                # %for.body
                                        # =>This Inner Loop Header: Depth=1
        lwzu 9, 20(4)          // A1
        lxsspx 1, 3, 7         //    B1
        lwz 10, 4(3)           // A2
        lxsspx 0, 4, 6         //    B2
        add 9, 10, 9           // A3
        stw 9, 4(3)            // A4
        xssubsp 0, 1, 0        //    B3
        stxsspx 0, 3, 7        //    B4
        bc 12, 2, .LBB0_5
# BB#3:                                 # %land.lhs.true.i
                                        #   in Loop: Header=BB0_2 Depth=1
        lxsspx 0, 4, 7
        lxsspx 1, 3, 8
        fcmpu 1, 1, 0
        bge 1, .LBB0_6
# BB#4:                                 # %if.then.i
                                        #   in Loop: Header=BB0_2 Depth=1
        stxsspx 0, 3, 8
        b .LBB0_8
        .p2align        4
.LBB0_5:                                # %entry.if.else_crit_edge.i
                                        #   in Loop: Header=BB0_2 Depth=1
        lwz 9, 8(4)
        b .LBB0_7
        .p2align        4
.LBB0_6:                                # %land.lhs.true.if.else_crit_edge.i
                                        #   in Loop: Header=BB0_2 Depth=1
        xscvdpspn 0, 0
        xxsldwi 0, 0, 0, 3
        mfvsrwz 9, 0
.LBB0_7:                                # %if.else.i
                                        #   in Loop: Header=BB0_2 Depth=1
        stw 9, 12(3)
.LBB0_8:                                # %bar.exit
                                        #   in Loop: Header=BB0_2 Depth=1
        addi 5, 5, -1
        bdnz .LBB0_2

Statement p1->f1 += p2->f1 is translated to A1,A2,A3,A4. Statement p1->f2 -= p2-
>f2 is translated to B1,B2,B3,B4. Inside the loop, all stores are through
pointer p1 only, although p1 can be aliased with p2, it is safe to keep a copy
of p1->f1 and p1->f2 inside registers, so A2 and B1 can be removed from the
loop.

In the source code, if I replace one of "p1->f3 = p2->f3" by "p1->f4 = p2->f4",
then llvm can generate expected optimized code.
Quuxplusone commented 8 years ago

In the optimizable case, the elimination of extra load occurs in "Global Value Numbering" pass.

Quuxplusone commented 8 years ago
Problem:

Look at the selected part of IR below:

for.body.lr.ph:
  %.pre6.i = getelementptr inbounds %struct.s, %struct.s* %p1, i64 0, i32 3
  %f3.i = getelementptr inbounds %struct.s, %struct.s* %p1, i64 0, i32 3

for.body:
  %2 = load i32, i32* %f11, align 4, !tbaa !7

if.else.i:                                        ; preds =
%land.lhs.true.if.else_crit_edge.i, %entry.if.else_crit_edge.i
  %f35.pre-phi.i = phi float* [ %.pre6.i, %entry.if.else_crit_edge.i ], [ %f3.i, %land.lhs.true.if.else_crit_edge.i ]
  %9 = bitcast float* %f35.pre-phi.i to i32*
  store i32 %8, i32* %9, align 4, !tbaa !9

The problem is that in its first iteration GVN thinks load in for.body is
clobbered by the store if.else.i. Later in the first iteration of for.body, we
remove the redundant phi in if.else.i. Ideally the next iteration of GVN should
look at the cleaned up code and realize that the store in if.else.i does not
clobber the load. Unfortunately at this point of time we are using information
cached in the first iteration of GVN. So we keep that load around. There are
two different caches involved here. One is mem dep analyzer cache (which keeps
the relation of load and store) and the other one is alias analysis cache
(which think %9 and %f11 may alias).
Quuxplusone commented 8 years ago
Possible solutions:

 Invalidating relevant parts of the caches involved, seems to be difficult and tricky (specially given that there are two caches involved). So it might be easier to teach BasicAliasAnalysis to recognize this pattern. This seems to be a fairly simple change in BasicAAResult::aliasGEP. What we are interested in is whether %f11 and %9 are aliased or not. %f11 is a GEP so we end up in  BasicAAResult::aliasGEP. What we currently do not do here is to realize that %9 is also a GEP. Actually we already know that instead of %9 we have to look at %f35.pre-phi.i. All we need to do is to call  llvm::SimplifyInstruction  %f35.pre-phi.i. to prove that  %f35.pre-phi.i. is actually a GEP. The rest of the work can be done with the current logic of BasicAAResult::aliasGEP. (A finalized code, may be written a little bit more general, and where we call  llvm::SimplifyInstruction  for both V1 and V2 before we reach BasicAAResult::aliasGEP, but that is implementation detail).
Quuxplusone commented 8 years ago
Correction of the typo in Comment #2:
Later in the first iteration of *GVN*, we remove the redundant phi in if.else.i

I am adding relevant part of IR again, because definition of f11 was missing

for.body.lr.ph:
  %f11 = getelementptr inbounds %struct.s, %struct.s* %p1, i64 0, i32 1
  %.pre6.i = getelementptr inbounds %struct.s, %struct.s* %p1, i64 0, i32 3
  %f3.i = getelementptr inbounds %struct.s, %struct.s* %p1, i64 0, i32 3

for.body:
  %2 = load i32, i32* %f11, align 4, !tbaa !7

if.else.i:
  %f35.pre-phi.i = phi float* [ %.pre6.i, %entry.if.else_crit_edge.i ],
                              [ %f3.i, %land.lhs.true.if.else_crit_edge.i ]

  %9 = bitcast float* %f35.pre-phi.i to i32*
  store i32 %8, i32* %9, align 4, !tbaa !9
Quuxplusone commented 8 years ago
So to start, not sure what you mean by "and the other one is alias analysis
cache (which think %9 and %f11 may alias)."

If you mean "aliasCache" in BasicAliasAnalysis, this is not a real cache, it's
really a visited set. It is cleared at the end of each query. If you are
finding results in there still being used in between queries, that's, IMHO, a
bug, because BasicAA is stateless..

If you are suggesting that BasicAA call SimplifyInstruction, that seems like a
really bad idea.  It is meant to be simple and stateless, and
simplifyinstruction is stateful and expensive (though you can pass it no state)
:)

I would not try to fix this in BasicAA, i would either fix the cache
invalidation, or accept that this won't get fixed till GVN is rewritten/uses
memoryssa (which will invalidate the right caches for you automatically).

I'm completely unsure why you think invalidation of the relevant parts of the
cache are tricky, it should be *already happening*.
The fact that it isn't is, IMHO, a bug.
Quuxplusone commented 8 years ago

Thanks for the information. The reason that I suggested BasicAA call SimplifyInstruction is that this relationship, in an indirect form, already exists. BasicAA calls GetUnderlyingObject which calls SimplifyInstruction.

I will get back to you about the other comments.

Quuxplusone commented 8 years ago
(In reply to comment #5)
>
> If you are suggesting that BasicAA call SimplifyInstruction, that seems like
> a really bad idea.  It is meant to be simple and stateless, and
> simplifyinstruction is stateful and expensive (though you can pass it no
> state) :)

To what state are you referring? It is not clear that it expensive relative to
other things that BasicAA already does.

>
Quuxplusone commented 8 years ago

BasicAA is actually pretty darn expensive already, so saying it's not expensive relative to what BasicAA already does may be true, but it doesn't make it cheap.

(The state is the assumptioncache, etc)

I am also pretty against adding more patching over to BasicAA that adds more recursive backwards walking (like we are suggesting, recursively figuring out that this thing is really, secretly, a GEP by proving value equivalence), which in turn makes BasicAA even more expensive.

I pretty much draw the line at proving the value equivalence of phi arguments. That is pretty much something else's job (and not even simplifyinstruction :P).

If we really care about getting that case, we should just do it right for once. Because, walking backwards means we are going to have to limit how deep it goes, and end up with the same problem we always do where we start getting different basicaa answers depending on how many lines of code/how deep of phi nodes you have.

How we do it, i can think of tons of good ways: Have VN let an alias analysis query it for results on equivalence, etc.

But having BasicAA try to go and discover this itself .. that seems a bit much. Yes, we do it in some cases. But as i said, this is precisely what makes BasicAA so expensive - we've built half a value numbering framework into it to try to figure out what things are really the same.

Except we've made that framework stateless, and walk in the wrong direction :)

Quuxplusone commented 8 years ago
(In reply to comment #8)
> BasicAA is actually pretty darn expensive already, so saying it's not
> expensive relative to what BasicAA already does may be true, but it doesn't
> make it cheap.

No argument.

>
> (The state is the assumptioncache, etc)'

Okay, that doesn't count :-)

>
> I am also pretty against adding more patching over to BasicAA that adds more
> recursive backwards walking (like we are suggesting, recursively figuring
> out that this thing is really, secretly, a GEP by proving value
> equivalence), which in turn makes BasicAA even more expensive.
>
> I pretty much draw the line at proving the value equivalence of phi
> arguments.  That is pretty much something else's job (and not even
> simplifyinstruction :P).
>
> If we really care about getting that case, we should just do it right for
> once. Because, walking backwards means we are going to have to limit how
> deep it goes, and end up with the same problem we always do where we start
> getting different basicaa answers depending on how many lines of code/how
> deep of phi nodes you have.
>
> How we do it, i can think of tons of good ways: Have VN let an alias
> analysis query it for results on equivalence, etc.
>
> But having BasicAA try to go and discover this itself .. that seems a bit
> much.  Yes, we do it in some cases. But as i said, this is precisely what
> makes BasicAA so expensive - we've built half a value numbering framework
> into it to try to figure out what things are really the same.
>
> Except we've made that framework stateless, and walk in the wrong direction
> :)

BasicAA and ValueTracking (and BasicAA uses ValueTracking too) should both be
on someone's hit list for this reason.
Quuxplusone commented 8 years ago

This is why I stopped investigating GVN. When we remove an instruction I in GVN we update memory dep caches for the removed instruction. For our case, we need to update the caches for a user of a user of I. (This is the third statement in if.else.i which is a user of %9, which is a user of the removed phi instruction). The dependecy that is cached and we need to get rid of is the one between the store to %9 and load of %2.

This, combined with the potential need to invalidate alias analysis cache (Which apparently was not really an issue) made it look a difficult path and on the other hand changing BasicAA looked simple. I was not aware of the issues in Basic AA.

Investigating GVN and mem dep analyzer for possibly doing what I mentioned in the first paragraph seems to be time-consuming. So I would like to ask people familiar with GVN to comment on correctness of this approach.

Another question: is there any good reason that InstCombine does not remove this phi?

Quuxplusone commented 8 years ago

From what I heard from Hal, new GVN is likely to get this case. The source is not in-tree yet.

Daniel: Is there any ETA for checking in the new gvn to the tree? If it is in tree we can test if it resolves this issue.

Quuxplusone commented 8 years ago
So, NewGVN, by itself, does not do PRE. PRE is implemented on top of the GVN
analysis results.

NewGVN itself is in okay shape (though it needs to be split into committable
pieces, it works, is as fast as i want it, and does everything it is supposed
to)

PRE on top of it is a complete prototype. I've built a lot of PRE
implementations for various compilers, so i can make them work pretty quickly.
The one i have hacked up on top of NewGVN handles all the cases current PRE
does, but it's just something hacked up to make sure stuff is working. It's not
committable in any way, shape, or form.

This is a long way of saying:
NewGVN, if i assigned someone tomorrow to work on getting it in tree, my guess
is it would take a month to get in tree.

PRE on top of it, because it is now a single PRE (GVN currently has two PRE
implementations, one for loads, one for scalars), would need work and tuning,
whether done in-tree or out of tree.
In fact, it is  pretty much 100% inevitable that a new PRE implementation that
is better than the current one will cause significant performance regressions
until other parts of the compiler are tuned as well.

It is likely i could get the algorithms engineered right pretty quickly for
someone else to pick up the work, but even if it was in-tree, it would be a
while until someone could turn it on by default.
Quuxplusone commented 8 years ago
(In reply to comment #10)
> This is why I stopped investigating GVN. When we remove an instruction I in
> GVN we update memory dep caches for the removed instruction. For our case,
> we need to update the caches for a user of a user of I. (This is the third
> statement in if.else.i which is a user of %9, which is a user of the removed
> phi instruction).  The dependecy that is cached and we need to get rid of is
> the one between the store to %9 and load of %2.

This seems like the cache is super-broken then.
Is this the documented behavior?
Because if you invalidate the result for I, it should be invalidating results
that depend on I, and this is one of those results :)

If for some reason, this is the documented behavior, the right solution still
doesn't seem that difficult.

GVN is calling replaceAllUsesWith to replace the removed instruction with the
new PRE PHI.
In that walk over uses, it should invalidate cached pointer info for all the
uses :)

This should invalidate %9.

>
> This, combined with the potential need to invalidate alias analysis cache
> (Which apparently was not really an issue) made it look a difficult path and
> on the other hand changing BasicAA looked simple. I was not aware of the
> issues in Basic AA.
>
> Investigating GVN and mem dep analyzer for possibly doing what I mentioned
> in the first paragraph seems to be time-consuming. So I would like to ask
> people familiar with GVN to comment on correctness of this approach.
>
> Another question: is there any good reason that InstCombine does not remove
> this phi?
Quuxplusone commented 8 years ago
(In reply to comment #13)
> GVN is calling replaceAllUsesWith to replace the removed instruction with
> the new PRE PHI.
> In that walk over uses, it should invalidate cached pointer info for all the
> uses :)
>
> This should invalidate %9.
>

Daniel,

It seems to be more complicated than that. I see two problems.

1- After removing an instruction in GVN we call Value:ReplaceAllUseWith to
update the users. So currently we do not invalidate cached pointers for the
users. It is possible to change the code to do that, but potentially we can
invalidate too aggressively and so increase the compile time compared to what
it is today (for example when GVN removes a redundant bitcast of a pointer)

2- Even if we do (1), it won't fix our problem. The root cause here is that
ReverseNonLocalDeps is a map whose key is Instruction *. So the key here is the
pointer to the store instruction not the store location (%9) as we are
interested in. The key of this entry in NonLocalPointerDeps is also the address
from which we load %2 (which is %f11) and we have no information about that
when we remove phi instruction. It might be possible to augment the cache data
structures with new information to invalidate the entry that we want.

I believe we need to fix both (1) and (2) to fix this issue. This is a big
change, and since this is my first time looking at GVN and mem dep analysis I
would like someone familiar with this part of code to comment on this, before I
proceed.
Quuxplusone commented 8 years ago
(In reply to comment #14)
> (In reply to comment #13)
> > GVN is calling replaceAllUsesWith to replace the removed instruction with
> > the new PRE PHI.
> > In that walk over uses, it should invalidate cached pointer info for all the
> > uses :)
> >
> > This should invalidate %9.
> >
>
> Daniel,
>
> It seems to be more complicated than that. I see two problems.
>
> 1- After removing an instruction in GVN we call Value:ReplaceAllUseWith to
> update the users. So currently we do not invalidate cached pointers for the
> users.
Because memdep is, IMHO, supposed to take care of this on it's own.

 It is possible to change the code to do that, but potentially we can
> invalidate too aggressively and so increase the compile time compared to
> what it is today (for example when GVN removes a redundant bitcast of a
> pointer)

Correctness trumps compile time.
Either memdep should be invalidating, or GVN should be invalidating, but
somebody needs to do it.

Looking at code comments, i suspect the answer is memdep should be doing this.

That is also the approach we take in memoryssa cache invalidation ATM - for
invalidation when stores get removed, we blow away the entire cache, because we
aren't tracking dependencies closely enough to give correct answers if we don't.

>
> 2- Even if we do (1), it won't fix our problem. The root cause here is that
> ReverseNonLocalDeps is a map whose key is Instruction *. So the key here is
> the pointer to the store instruction not the store location (%9) as we are
> interested in. The key of this entry in NonLocalPointerDeps is also the
> address from which we load %2 (which is %f11) and we have no information
> about that when we remove phi instruction. It might be possible to augment
> the cache data structures with new information to invalidate the entry that
> we want.

Again, i'm going to state, flat out, that if the cache does not return correct
answers in the face of invalidation, either the caller of the invalidation API
is doing the wrong thing (IE not calling it on enough stuff), or the
invalidation is not invalidating enough of the cache.

This is true *even* if the answer is "any time you call invalidate, it destroys
the entire cache". It needs to work properly, first and foremost  we should not
hack around it.

>
> I believe we need to fix both (1) and (2) to fix this issue. This is a big
> change, and since this is my first time looking at GVN and mem dep analysis
> I would like someone familiar with this part of code to comment on this,
> before I proceed.

I think your analysis is pretty spot on. IMHO, it seems like memdep is *really*
broken here.

At this point, you should bring this up on the mailing list to discuss the
issue further, so others can be alerted and comment, because this is a
significantly larger issue than expected (because, as i said, it seems memdep
is super-broken here)
Quuxplusone commented 8 years ago
> Either memdep should be invalidating, or GVN should be invalidating, but
> somebody needs to do it.

Yes, I forgot to mention that memdep also does not look at the users of the
instruction being removed.

>
> This is true *even* if the answer is "any time you call invalidate, it
> destroys the entire cache". It needs to work properly, first and foremost
> we should not hack around it.
>

So one potential solution is to use mem location or something derived from it
as the reverse map key. I will check if this may work or not.

>
> At this point, you should bring this up on the mailing list to discuss the
> issue further, so others can be alerted and comment, because this is a
> significantly larger issue than expected (because, as i said, it seems
> memdep is super-broken here)

Sure.
Quuxplusone commented 8 years ago
(In reply to comment #16)

As a meta-comment, one thing i'm *seriously* concerned about if we were to just
try to say the current situation is "okay" is that it means preservation of a
pass (gvn preserves memdep) does not imply that the pass will give as good of
answer as if you ran it again :)

The problem with such a scheme is that it introduces it's own wonderful set of
pass-ordering issues, where now pass ordering doesn't just depend on the order
of passes and the optimizations they perform, but "how well they preserve a
give analysis that they claim they preserve".

As a compromise, if we can't make this sane pretty easily, my suggestion would
be that we *not* fix this bug for now, put huge warnings on memdep's current
API about it being deprecated, and then run screaming away from memdep as fast
as we can.
Quuxplusone commented 8 years ago

Daniel

Independent of the problem reported here, how practical is the idea of migrating current GVN to use MemorySSA instead of memdep?

Quuxplusone commented 8 years ago
(In reply to comment #18)
> Daniel
>
> Independent of the problem reported here, how practical is the idea of
> migrating current GVN to use MemorySSA instead of memdep?

It's theoretically possible, but completely objectively, it would
A. Be essentially half the work of writing NewGVN, again, from scratch :)
B. likely be complex and significantly slow down GVN.

A lot of the work of NewGVN was, in fact, figuring out good ways to reformulate
the memdep queries that GVN currently does to not use memdep (the other parts
were figuring out handling equivalences without actual elimination)

No matter how you slice it, it would essentially mean moving most of NewGVN
into GVN.

It's basically "NewGVN, with the datastructure changes but not the the
algorithm change"
Quuxplusone commented 8 years ago

I am probably missing something, but I don't see how memdep problem can be entirely fixed. In the example above, assume we had a pointer %10 which was a user of %9. After removing phi, we have more accurate aliasing information for %9, which may result in more accurate aliasing for %10. We know that we want to invalidate cached information for %9 (which is a user of removed phi). We probably also want to invalidate cached information for %10 (which is a user of a user of removed phi). How far should we go to be able to guarantee removal of all stale information from the cache?

Quuxplusone commented 8 years ago
(In reply to comment #20)
> I am probably missing something, but I don't see how memdep problem can be
> entirely fixed.

Sure it can.

 In the example above, assume we had a pointer %10 which was
> a user of %9. After removing phi, we have more accurate aliasing information
> for %9, which may result in more accurate aliasing for %10.

I need an explicit example.
However, you only end up having to be recursive upon removal of stores (or
rather, defs, since certain types of atomic loads may qualify)

> We know that we
> want to invalidate cached information for %9 (which is a user of removed
> phi).  We probably also want to invalidate cached information for %10 (which
> is a user of a user of removed phi).

Yes

>  How far should we go to be able to
> guarantee removal of all stale information from the cache?

As far as the dependency chain goes.
Removing stores is rare.

For memoryssa, as i said, we just blow away the *entire cache* on store removal
right now.

You can also just track dependencies.

If you don't want to pay the price of having constant incremental update cost,
it's possible to have a batch update API.

But, as I said, playing with stores is much more rare than playing with loads.

GVN is also pretty much the only thing that *iterates* it, too.
Quuxplusone commented 8 years ago

Attached ExampleForComment20.txt (3026 bytes, text/plain): An explicit example for the issue raised in comment 20

Quuxplusone commented 8 years ago

Also at this point, I am not sure if/how this impacts call dependencies and its cache data structure. That is something else for me to investigate.

Quuxplusone commented 8 years ago
Let me ask two questions to make sure I understand your comments.

1- I definitely agree that fixing memdep to have a fully valid cache is going
to be complicated. (If that is what you mean). In that case are you suggesting
that (as you said in comment 17) we deprecate memdep and stop using it?
2- I am not familiar with MemorySSA, but assume (just for the sake of this
question) that we started using MemorySSA in current GVN. Is MemorySSA cache
invalidated, so that it does not contain stale dependency information even in
the presence of examples like the one I attached ?

To be clear, I do not have any particular interest in getting the optimization
for the attached example. My understanding was that we are talking about fixing
memdep cache so it is always invalidated properly and do not contain any stale
information. So I constructed this example as an edge case which will be hard
to fix.
Quuxplusone commented 8 years ago
(In reply to comment #25)
> Let me ask two questions to make sure I understand your comments.
>

Sure!

> 1- I definitely agree that fixing memdep to have a fully valid cache is
> going to be complicated. (If that is what you mean).

Yes.

>  In that case are you
> suggesting that (as you said in comment 17) we deprecate memdep and stop
> using it?
Yes, at the very least for optimizations like this, and at most for everything.

> 2- I am not familiar with MemorySSA, but assume (just for the sake of this
> question) that we started using MemorySSA in current GVN. Is MemorySSA cache
> invalidated, so that it does not contain stale dependency information even
> in the presence of examples like the one I attached ?

Answering this requires a bit of background:

MemorySSA is an SSA form for memory.
There is no caching inherent in it.  It's an SSA form that represents memory,
and memory defining/using operations.

The SSA form itself is built, it has no caching, it has def-use chains.  So,
for example, a load is a MemoryUse and a store is a MemoryDef.  Each def
defines a memory state, the uses use some version of the memory state.

It is a may-def/may-use form, so "false" aliasing may be present.  It does some
disambiguation upfront for loads, but not for stores.

To give a concrete example, given

(b and a are unalised, b's memory is defined somewhere before the start of the
function)
store a
load a
load b

It will produce:

1 = memorydef(liveOnEntry)
store a
MemoryUse(1)
load a
MemoryUse(liveOnEntry)
load b

So you can see that load a is linked to store a, and load b is not aliased with
anything above it, and thus, is said to be using the memory state that is live
on entry to the function,.

There are more nuances (and the form is deliberately not as precise as
possible), but this is enough background to answer the question. MemorySSA.h
goes into more detail.  :)

On top of the form itself (which is useful for a lot of purposes), it provides
a default set of "walkers", that can walk the MemorySSA form and disambiguate
things further for you if you want.

That is, they know how to walk def-use chains and bypass false aliasing to
answer various queries.

So you can say "give me the thing that clobbers this load", and it will figure
it out and return it to you.

You can also just follow the use-def chain, but as i said, that is deliberately
not as precise. Depending on how much disambiguation is needed, it may or may
not already prove the thing you want.

(IE if you are trying to see if you can hoist a load to block X, it may already
tell you that it's defined by a store high enough in the dominator tree that
you can move the load. If so, you don't need to go trying to figure out further
whether that store really kills that load.  If you want to know how high you
can hoist the load, you do need to try to figure that out)

The default walker you get does caching  internally for you, and caches various
things to make queries go fast.

You don't have to do this, you can use a walker that does *no* caching.  You
can pay the price of having slower queries for having always up to date info.

You can also easily build a walker that has additional knowledge, such as GVN
has, so that it uses knowledge about equivalence to further disambiguate
loads/stores.

etc

The walkers present a very simple interface, and their implementation details
are up to them. We just provide a default set of walkers that cover a few use
cases well.

>
> To be clear, I do not have any particular interest in getting the
> optimization for the attached example. My understanding was that we are
> talking about fixing memdep cache so it is always invalidated properly and
> do not contain any stale information. So I constructed this example as an
> edge case which will be hard to fix.

Understood. I greatly appreciate it.  It does tell me it's probably not worth
fixing it if these are the cases we want to get :)
Quuxplusone commented 8 years ago
(In reply to comment #26)
>
> >  In that case are you
> > suggesting that (as you said in comment 17) we deprecate memdep and stop
> > using it?
> Yes, at the very least for optimizations like this, and at most for
> everything.
>
>

Just double checking to avoid any mis-communication. So we do not want to fix
the problem reported in this PR, by "fixing" memdep cache (say, by looking only
at the first one or two level of uses of values impacted by GVN). Is that
correct?
Quuxplusone commented 8 years ago
I looked at BasicAA again. I want to make a change in my original proposal.
Below I explain this.

The fundamental problem in BasicAA that causes this issue here is the order in
which aliasCheck calls aliasGEP and aliasPHI. I tried to reorder them (put
aliasPHI first and then aliasGEP) but that results in loss of precision in some
cases. No matter in which order we call these, we will lose precision for some
cases. The current order, creates problems such as this bug. If we reorder them
we lose precision for examples like this:

%x.tr = phi %struct.a* [ %x, %entry ], [ null, %land.lhs.true ]
%mode = getelementptr inbounds %struct.a, %struct.a* %x.tr, i32 0, i32 1

I think this is a more complicated problem that also applies to aliasGEP and
aliasSelect ordering as well. (and maybe other function call orderings). So I
go back to simpler solutions.

We call simplifyInstruction quite frequently indirectly from aliasCheck, and
there are ways to remove some of those calls (we may call GetUnderlyingObject
for the same value multiple times during recursive invocations of aliasCheck.
Fixing that removes some of the calls to simplifyInstruction). So I think from
a compile time point of view, we may be able to afford a new call to
simplifyInstruction, but I don't want to do that, because the benefits of that
for anything other than PHI is not clear at this point.

A slightly different way of looking at the current problem is this. All the
incoming values to our PHI nodes are the same pointer. So we want to do
something like what we do  in stripPointerCasts. This time we want to do it for
PHIs instead of GEPs and other cases that are handled by stripPointerCasts. So
I have prepared a patch that in aliasCheck, right after the call to
stripPointerCasts we check if we have a phi where all incoming values are the
same. If so, we just look at that incoming value instead of the phi itself. I
want to add the code for removing redundant calls to GetUnderlyingObject as
well and then post it for review. Let me know if you have any comment.
Quuxplusone commented 3 years ago
I think this has been resolved in current trunk. BasicAA will now look through
phis and we also eliminate/forward loads across BBs.

Trunk generates https://godbolt.org/z/f4Eo3x

foo:                                    # @foo
        cmpwi   5, 1
        bltlr   0
        li 7, 8
        lwz 6, 4(3)
        clrldi  5, 5, 32
        addi 4, 4, -16
        lfs 0, 8(3)
        mtctr 5
        li 5, 4
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        lwzu 8, 20(4)
        lfsx 1, 4, 5
        add 6, 6, 8
        stw 6, 4(3)
        xssubsp 0, 0, 1
        stfsx 0, 3, 7
        lwz 8, 8(4)
        stw 8, 12(3)
        bdnz .LBB0_2
        blr
        .long   0
        .quad   0