Closed StefanKarpinski closed 7 years ago
I wasn't even aware that these short-circuit, but I see that it is documented in the manual. (I guess it comes from Python?) Somehow I doubt that many programs rely on this, but I don't know how to go about doing a survey.
The manual says "However, the order of evaluations in a chained comparison is undefined". This makes the short-circuit behavior useless anyway, so you might as well get rid of it. Any code that relies on the short-circuit behavior is probably broken anyway, thanks to the undefined order.
I'm not sure about the efficiency trade-off. Assuming short-circuit (lazy) semantics, several cases come to mind for a() < b() < c()
:
c()
is expensive, and therefore short-circuiting wins.c()
is cheap, and therefore the extra branch is the overriding cost. But if c()
is that cheap, it's probably all inline and the compiler can estimate its cost fairly well, leading to two subcases of concern:
c()
is provably safe to evaluate unnecessarily. Then the compiler can convert short-circuit to eager evaluation. (I've worked on a compiler that did this, but I'm not sure about LLVM.)c()
is not provably safe to evaluate unnecessarily. Common examples are indirect load off a possibly null pointer or possible divide by zero. Not being able to evaluate c()
unnecessarily can hurt vectorization on machines without predication (e.g. won't be a problem for AVX-512, but can be a problem for SSE/AVX.).For the last case, there is also a flow-analysis loss for lazy evaluation: with evaluation of c()
uncertain, subsequent evaluations "downstream" cannot be proven to be redundant. But modern partial-redundancy elimination (PRE) techniques deal with that. (I'm not sure what the state of PRE is in LLVM. It seemed to be missing from LLVM 3.3 from what I recall.)
Maybe we could add a switch to control lazy versus eager evaluation of chained comparisons and see what the impact is on current benchmarks of interest?
With respect to "order of evaluations in a chained comparison is undefined", I'd be in favor nailing down the order, regardless of whether we decide on eager or lazy evaluation, because having the order differ in different implementations can cause a lot of pain. The optimization advantages of undefined order, useful on ye olde PDP 11 seem to have evaporated with modern compilers. Even the C++ committee is seriously considering nailing down the order.
Then the compiler can convert short-circuit to eager evaluation.
@ArchRobison: if LLVM was better at doing this, I would agree with your analysis, but it seems to either never do this or be very bad at it. Perhaps that's effectively agreement since LLVM could be improved to make this argument go through. Similarly for the second subcase.
I wholeheartedly agree about making the order of evaluation defined.
I would think that a lot of chained comparisons in practice are safe to evaluate unnecessarily, e.g. 0 < i ā¤ n
and similar are probably very common. I just checked, and LLVM 3.7.1 in Julia 0.5 is not able to convert short-circuit to eager evaluation in 0 < i ā¤ n
for Int
types.
In the case where one of the operands is very expensive to evaluate unnecessarily, one can always write &&
explicitly. Of course, in the current situation you could always write &
explicitly, but eager evaluation seems like a better default to me since "cheap" operands are probably more common if I had to guess (assuming eager is faster in that case) . Of course, it would be even better if LLVM were just smarter.
I just did a quick grep through an archive of all registered Julia packages for chained <
(or <=
) and chained ==
expressions, and in all cases I could find it was safe (and probably advantageous) to evaluate eagerly. Almost all operands were constants, variables, simple functions like length(A)
or size(M,1)
or zero(š)
or kernel.Nš + 1
, or array lookups x[i]
. In the few examples where there was a more expensive operand, it was something like lb <= fitness(gt) < ub
where eager evaluation is safe because you have to evaluate fitness(gt)
anyway.
Of course, I could have missed some; this is hard to grep for, and there were a lot of false positives to sift through. But I think I at least got a representative sampling.
Thanks for looking. Array lookups can be surprisingly expensive, depending on whether they hit L1 cache or not, and branches can be surprisingly cheap, depending on whether the branch predictor bets right. (Modern CPUs have become a casino.)
I tried the 0 < i ā¤ n
example in C both ways:
void bar();
void f(int i, int n) {
if (0<i && i<=n) bar();
}
void g(int i, int n) {
if ((0<i) & (i<=n)) bar();
}
For 64-bit x86, clang -O3 and icc -O3 leave the first as lazy and the second as eager. gcc 5.2 changed both to the lazy form, suggesting that the gcc developers bet that lazy is faster than eager on 64-bit x86. Story on other processors may differ.
I think timing benchmarks of interest is the data we need to make a decision based on performance considerations, and performance seems to one of the motivations for this issue.
I looked at this back when I was introducing the array indexing infrastructure. At the time, it was surprisingly cheaper to have the short-circuiting semantics (https://github.com/JuliaLang/julia/pull/10525#discussion_r29102527) ā those branches are extremely predictable.
In this specific case, it'd be interesting to try reinterpret(UInt, i - 1) < n
; is there a particular form in which LLVM would be able to do that optimization (if it is indeed an optimization)?
A quick benchmark:
function foo(a)
s = 0
for i in 2:length(a)
s += a[i-1] < i < a[i]
end
end
function bar(a)
s = 0
for i in 2:length(a)
s += (a[i-1] < i) & (i < a[i])
end
end
a = rand(Int, 10^4)
@elapsed(foo(a)) / @elapsed(bar(a))
gives a 5x slowdown from short-circuiting. This seems like a pretty huge penalty, especially considering that the comparisons per se are only part of the loop body. Surprisingly, if I use @inbounds
in the loop body, the ratio decreases to 1.5 (i.e. short-circuit is slower, but not by nearly as much). This is confusing to me.
Nice example. One improvement: make foo
and bar
return the value of s
. Otherwise the benchmark is just searching for out-of-bounds indices. Though even with that fix I'm seeing the crazy big time ratio (about 3.4x for a 10^7 element a
). There's a lesson here that I want to figure out.
The slowdown is surely a branch miss-prediction issue, since using a = rand(10^4:10^5, 10^4)
in the example causes the ratio to become about 0.9 for me. I tried a variation of the example with lots of delinquent loads (see below), and the ratio was about 1.2x, still in favor of eager.
function foo(a,b)
s = 0
for i in 2:length(a)
s += a[b[i-1]] < i < a[b[i]]
end
s
end
function bar(a,b)
s = 0
for i in 2:length(a)
s += (a[b[i-1]] < i) & (i < a[b[i]])
end
s
end
a = rand(Int, 10^8)
b = rand(1:10^8, 10^8)
@elapsed(foo(a,b)) / @elapsed(bar(a,b))
A thought that passed my mind is that the quick benchmark has the chain that is not used to direct control flow. It might be useful to understand how often that happens compared to the control-flow case.
I personally love the chained form for assertions, but in that scenario the branches are highly predictable unless I'm having a bad day.
Triage: resolved that this doesn't matter much and changing it now would be pointless churn. There are corner cases where the current behavior could be more efficient. In cases where it's less efficient, future optimization work can address that gap, whereas laziness is semantically significant and therefore cannot be avoided.
It's harder to lower, more surprising and usually less efficient. We should just lower
a < b < c
to(a < b) & (b < c)
.