JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.73k stars 5.49k forks source link

don't short-circuit chained comparisons #16088

Closed StefanKarpinski closed 7 years ago

StefanKarpinski commented 8 years ago

It's harder to lower, more surprising and usually less efficient. We should just lower a < b < c to (a < b) & (b < c).

stevengj commented 8 years ago

I wasn't even aware that these short-circuit, but I see that it is documented in the manual. (I guess it comes from Python?) Somehow I doubt that many programs rely on this, but I don't know how to go about doing a survey.

stevengj commented 8 years ago

The manual says "However, the order of evaluations in a chained comparison is undefined". This makes the short-circuit behavior useless anyway, so you might as well get rid of it. Any code that relies on the short-circuit behavior is probably broken anyway, thanks to the undefined order.

ArchRobison commented 8 years ago

I'm not sure about the efficiency trade-off. Assuming short-circuit (lazy) semantics, several cases come to mind for a() < b() < c():

For the last case, there is also a flow-analysis loss for lazy evaluation: with evaluation of c() uncertain, subsequent evaluations "downstream" cannot be proven to be redundant. But modern partial-redundancy elimination (PRE) techniques deal with that. (I'm not sure what the state of PRE is in LLVM. It seemed to be missing from LLVM 3.3 from what I recall.)

Maybe we could add a switch to control lazy versus eager evaluation of chained comparisons and see what the impact is on current benchmarks of interest?

ArchRobison commented 8 years ago

With respect to "order of evaluations in a chained comparison is undefined", I'd be in favor nailing down the order, regardless of whether we decide on eager or lazy evaluation, because having the order differ in different implementations can cause a lot of pain. The optimization advantages of undefined order, useful on ye olde PDP 11 seem to have evaporated with modern compilers. Even the C++ committee is seriously considering nailing down the order.

StefanKarpinski commented 8 years ago

Then the compiler can convert short-circuit to eager evaluation.

@ArchRobison: if LLVM was better at doing this, I would agree with your analysis, but it seems to either never do this or be very bad at it. Perhaps that's effectively agreement since LLVM could be improved to make this argument go through. Similarly for the second subcase.

I wholeheartedly agree about making the order of evaluation defined.

stevengj commented 8 years ago

I would think that a lot of chained comparisons in practice are safe to evaluate unnecessarily, e.g. 0 < i ā‰¤ n and similar are probably very common. I just checked, and LLVM 3.7.1 in Julia 0.5 is not able to convert short-circuit to eager evaluation in 0 < i ā‰¤ n for Int types.

In the case where one of the operands is very expensive to evaluate unnecessarily, one can always write && explicitly. Of course, in the current situation you could always write & explicitly, but eager evaluation seems like a better default to me since "cheap" operands are probably more common if I had to guess (assuming eager is faster in that case) . Of course, it would be even better if LLVM were just smarter.

stevengj commented 8 years ago

I just did a quick grep through an archive of all registered Julia packages for chained < (or <=) and chained == expressions, and in all cases I could find it was safe (and probably advantageous) to evaluate eagerly. Almost all operands were constants, variables, simple functions like length(A) or size(M,1) or zero(šœ™) or kernel.Nšœ™ + 1, or array lookups x[i]. In the few examples where there was a more expensive operand, it was something like lb <= fitness(gt) < ub where eager evaluation is safe because you have to evaluate fitness(gt) anyway.

Of course, I could have missed some; this is hard to grep for, and there were a lot of false positives to sift through. But I think I at least got a representative sampling.

ArchRobison commented 8 years ago

Thanks for looking. Array lookups can be surprisingly expensive, depending on whether they hit L1 cache or not, and branches can be surprisingly cheap, depending on whether the branch predictor bets right. (Modern CPUs have become a casino.)

I tried the 0 < i ā‰¤ n example in C both ways:

void bar();

void f(int i, int n) {
    if (0<i && i<=n) bar();
}

void g(int i, int n) {
    if ((0<i) & (i<=n)) bar();
}

For 64-bit x86, clang -O3 and icc -O3 leave the first as lazy and the second as eager. gcc 5.2 changed both to the lazy form, suggesting that the gcc developers bet that lazy is faster than eager on 64-bit x86. Story on other processors may differ.

I think timing benchmarks of interest is the data we need to make a decision based on performance considerations, and performance seems to one of the motivations for this issue.

mbauman commented 8 years ago

I looked at this back when I was introducing the array indexing infrastructure. At the time, it was surprisingly cheaper to have the short-circuiting semantics (https://github.com/JuliaLang/julia/pull/10525#discussion_r29102527) ā€” those branches are extremely predictable.

In this specific case, it'd be interesting to try reinterpret(UInt, i - 1) < n; is there a particular form in which LLVM would be able to do that optimization (if it is indeed an optimization)?

stevengj commented 8 years ago

A quick benchmark:

function foo(a)
       s = 0
       for i in 2:length(a)
           s += a[i-1] < i < a[i]
       end
end
function bar(a)
       s = 0
       for i in 2:length(a)
           s += (a[i-1] < i) & (i < a[i])
       end
end
a = rand(Int, 10^4)
@elapsed(foo(a)) / @elapsed(bar(a))

gives a 5x slowdown from short-circuiting. This seems like a pretty huge penalty, especially considering that the comparisons per se are only part of the loop body. Surprisingly, if I use @inbounds in the loop body, the ratio decreases to 1.5 (i.e. short-circuit is slower, but not by nearly as much). This is confusing to me.

ArchRobison commented 8 years ago

Nice example. One improvement: make foo and bar return the value of s. Otherwise the benchmark is just searching for out-of-bounds indices. Though even with that fix I'm seeing the crazy big time ratio (about 3.4x for a 10^7 element a). There's a lesson here that I want to figure out.

ArchRobison commented 8 years ago

The slowdown is surely a branch miss-prediction issue, since using a = rand(10^4:10^5, 10^4) in the example causes the ratio to become about 0.9 for me. I tried a variation of the example with lots of delinquent loads (see below), and the ratio was about 1.2x, still in favor of eager.

function foo(a,b)
       s = 0
       for i in 2:length(a)
           s += a[b[i-1]] < i < a[b[i]]
       end
       s
end
function bar(a,b)
       s = 0
       for i in 2:length(a)
           s += (a[b[i-1]] < i) & (i < a[b[i]])
       end
       s
end
a = rand(Int, 10^8)
b = rand(1:10^8, 10^8)
@elapsed(foo(a,b)) / @elapsed(bar(a,b))
ArchRobison commented 8 years ago

A thought that passed my mind is that the quick benchmark has the chain that is not used to direct control flow. It might be useful to understand how often that happens compared to the control-flow case.

I personally love the chained form for assertions, but in that scenario the branches are highly predictable unless I'm having a bad day.

StefanKarpinski commented 7 years ago

Triage: resolved that this doesn't matter much and changing it now would be pointless churn. There are corner cases where the current behavior could be more efficient. In cases where it's less efficient, future optimization work can address that gap, whereas laziness is semantically significant and therefore cannot be avoided.