Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Differences between `base+width*i` and `base+=width` #42805

Open Quuxplusone opened 4 years ago

Quuxplusone commented 4 years ago
Bugzilla Link PR43835
Status NEW
Importance P enhancement
Reported by Roman Lebedev (lebedev.ri@gmail.com)
Reported on 2019-10-29 05:42:17 -0700
Last modified on 2021-08-16 12:24:27 -0700
Version trunk
Hardware PC Linux
CC efriedma@quicinc.com, florian_hahn@apple.com, hfinkel@anl.gov, johannes@jdoerfert.de, listmail@philipreames.com, llvm-bugs@lists.llvm.org, max.kazantsev@azul.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also PR43836
https://godbolt.org/z/TR-s8A

#include <array>
#include <algorithm>

void sink(int* row);

void v0(int* data, int len, int width) {
    for(int i = 0; i < len; i++) {
        int *row = data + width*i;
        sink(row);
    }
}
void v1(int* row, int len, int width) {
    for(int i = 0; i < len; i++, row+=width) {
        sink(row);
    }
}

In first case we get

9:                                                ; preds = %9, %5
  %10 = phi i64 [ 0, %5 ], [ %13, %9 ]
  %11 = mul nsw i64 %10, %6
  %12 = getelementptr inbounds i32, i32* %0, i64 %11
  tail call void @_Z4sinkPi(i32* %12)
  %13 = add nuw nsw i64 %10, 1

but in second:

  %9 = phi i32 [ 0, %5 ], [ %11, %8 ]
  %10 = phi i32* [ %0, %5 ], [ %12, %8 ]
  tail call void @_Z4sinkPi(i32* %10)
  %11 = add nuw nsw i32 %9, 1
  %12 = getelementptr inbounds i32, i32* %10, i64 %6

I'd guess the second variant is better?
The difference is interesting.
Quuxplusone commented 4 years ago
Or more globally, this was the original pattern i was looking at:
https://godbolt.org/z/Bzn5Lz

#include <array>
#include <algorithm>

void sink(int* row, int* predRow);

void bad(int* data, int len, int width) {
    for(int i = 1; i < len; i++) {
        int *row = data + width*i;
        int *predRow = data + width*(i-1);
        sink(row, predRow);
    }
}
void good(int* data, int len, int width) {
    int* predRow = data;
    for(int i = 1; i < len; i++, data+=width, predRow+=width) {
        sink(data, predRow);
    }
}

In first case there are two gep's, in second there is an extra phi
and only one gep.
Quuxplusone commented 4 years ago

Interesting. I suppose that IndVarSimplify doesn't catch this. Which form is actually better, in some general sense, is target dependent (depending on whethr a target has various pre/post increment operations), but LSR won't create new PHIs, and that's another complication in this space.

Quuxplusone commented 2 years ago
Surely sticking with the current status-quo leaves the performance on the floor
for the code written in the opposite style from the one optimal for given
$target.

I would really guess that the integer induction should be canonical?
For which targets would pointer induction be better?
Quuxplusone commented 2 years ago

From an analysis standpoint, it doesn't really matter much; the SCEV expressions are equivalent either way. And there aren't really any interesting optimizations if we canonicalize early, I think. IndVarSimplify used to canonicalize aggressively, but we turned that off a while back because it was messing up optimized loops without any obvious benefit.

If LSR can't treat the two forms as equivalent, that's obviously a problem. Not sure why that's happening here, though, at first glance.

Quuxplusone commented 2 years ago
The reason i'm revisiting this is https://reviews.llvm.org/D107935
I'm basically thinking about having a pass to rewrite all pointer math so that:
1. loop-invariant part of address computation should be done in preheader
2. canonicalize to index induction

There already exists SeparateConstOffsetFromGEP pass,
and it seems to be run by a few backends already.
Not sure if we can extend & use it.
Quuxplusone commented 2 years ago

Feel free to try, but if you seriously pursue this, I suspect you're going to end up spending a lot of time looking at LSR...

Please make sure you take some time to look at the history of disable-iv-rewrite.

Quuxplusone commented 2 years ago

Unfortunately, I agree with Eli that this is going to be delicate and very complicated to get tuned just right. The first step will be tuning LSR to ensure that the (proposed) canonical form gets turned into the optimal sequence on each target reliably. LSR is fragile and annoying, so that's a significant amount of work and risk to take on.

Unless you are strongly motivated by this case, I'd honestly advise against it.