Optimize for loop for Range, ArraySequence, and Array

gavinking commented 11 years ago

Our for loop is slow:

for (i in min..max) can be optimized to for (int i=min; i<=max; i++) (but watch out for reverse ranges!)
for (i in min:len) can be optimized to for (int i=min; i<min+len; i++)
for (x in arrayOrArraySequence) can be optimized to use a C-style for and direct access to the underlying java array by index

tombentley commented 11 years ago

For a Sequence we'd need an instanceof to know it was an ArraySequence and thus we'd need to transform the block twice, once for the optimised loop and again for the non-optimised loop.

gavinking commented 11 years ago

@tombentley yes. right. Indeed I think you would want to do that instanceof for any List or perhaps even for any Iterable.

tombentley commented 11 years ago

Java has the little known java.util.RandomAccess marker interface, though I don't know where it's actually used.

We could do something similar or perhaps something along these lines:

interface For<T> satisfies Iterable<T>&Correspondence<Integer,T>&Sized {
    shared formal Boolean preferIterator;
}

Then potentially other collections can benefit.

quintesse commented 11 years ago

@tombentley I don't think that would work, because we can only do this optimization because we know that the underlying implementation for Array and ArraySequence is a Java array. I do wonder about these optimizations though, they are okay, but if the reason is because our iterators really are much slower maybe we should find out why. And finally, optimizations are good of course, but first lets finish the language/compiler itself, because, as we can already see in these simple examples, optimizations make things a lot more complex. (Premature optimization, root-of-all-evil and such)

gavinking commented 11 years ago

@quintesse iterators are slow because they do allocation. Iterating 0..10 allocates 11 instances of Integer.

And finally, optimizations are good of course, but first lets finish the language/compiler itself, because, as we can already see in these simple examples, optimizations make things a lot more complex.

We need to do this optimization now because it's something we already know is very inefficient and anyone who benchmarks Ceylon code will notice it immediately. People are already complaining on twitter.

tombentley commented 11 years ago

@quintesse I don't see how that doesn't work: We don't have to guarantee that preferIterator returns the same thing on all platforms, or even for all instances of a given class. The Java ArraySequence would just override preferIterator to always return true and our loops would look something like this:

if ((seq instanceof For) && !((For)seq).getPreferIterator()) {
    // C-style loop
} else {
    // Iterator loop
}

tombentley commented 11 years ago

BTW, I'm not saying I like the duplicated block at all. That could lead to a lot of bloat. But I don't see how we can avoid it either, while still generating optimized code.

I thought about doing some static analysis to try to only generate one of the blocks. For example if the loop is in a non-shared method and you can prove it's only ever called with an ArraySequence. But you also have to worry about method references being taken and passed outside the containing scope of the method declaration. In other words it's quite hard to do right. That kind of optimization can definitely wait, imo.

tombentley commented 11 years ago

Once we have reified generics there are other optimisations we could do too, for example It you know you''re generating a Iterable<Integer> you can have hidden method which returns the primitive ints. Then when you know you're iterating a Iterable<Integer> you can use that method instead. This doesn't work well for our Iterator though because it returns a union (Object).

Another thought: We could 'intern' some Integer instances, say -100..255, which would prevent allocation in a lot of common cases.

tombentley commented 11 years ago

Ditto Character, I suppose.

quintesse commented 11 years ago

@gavinking But that's a problem we'll always have with the current implementation, fixing this only for those specific cases is nothing more than applying patches. People will still notice the bad performance the moment they change the code trivially:

value iter = 0..5;
for (i in iter) { ... }

for (i in (0..10).filter((Integer n) n % 2 == 0) { ... }

or when they start using their own generic interfaces.

This can only be solved by doing something much more clever (like making special optimized versions of classes for the basic types. There are already long discussions on the mailing list about that).

@tombentley Yes, but you gloss over the // C-style loop, how will you implement that without knowing how to get the item? You could use the item() from Correspondence but it has the same problem as Iterator: because it's generic it will have to box its result.

quintesse commented 11 years ago

Another thought: We could 'intern' some Integer instances, say -100..255, which would prevent allocation in a lot of common cases.

Yes definitely, that was always the idea (that's why we use instance() methods, something we still have to do for Tuple.)

gavinking commented 11 years ago

making special optimized versions of classes for the basic types

Sorry, but I think this is probably never going to be worth the pain and effort. And it will very likely never be as fast as a C-style for, even though it is way more work to implement.

I can't understand why you would not want to give users something that is as fast as an ordinary C-style for when they write the extremely common and natural:

for (i in 0:n) { ... }

I mean, I have seen that thousands of times in my career, and I don't think I've ever seen anyone try to write it as:

value iter = 0:n;
for (i in iter) { ... }

Seriously.

gavinking commented 11 years ago

We could 'intern' some Integer instances, say -100..255

This might help in business code. Would not help in scientific-computing-style code where n is very often large.

And of course it will still not be as fast as a C-style loop.

quintesse commented 11 years ago

Once we have reified generics there are other optimisations we could do too, for example It you know you''re generating a Iterable you can have hidden method which returns the primitive ints

We don't need reified generics for this, because that code would need to be generated statically (or we would have to generate a switch-case statement for all the possible basic types). In these examples knowing that we deal with a Iterable<Integer> should already be enough, we just need to know that those extra optimized methods exist.

There exist already ideas for this on the mailing list. One possible solution would be to always generate these optimized methods, but this results in a combinatorial explosion when you have more than one type parameter. So a second option would be to mark, with an annotation, the type parameters you want so optimized.

Anyway, you're right that this wouldn't work for Iterator because of its design.

tombentley commented 11 years ago

@gavinking: I agree that optimizing type-parameterised stuff would be a huge pain, but Tako's right: Without it we end up boxing an int just so we can call item().

quintesse commented 11 years ago

@gavinking

Seriously.

Yes seriously, that was just an example of how easy it would be to get non-optimized code, the moment you do anything but a simple counter you'll have bad performance again. People will go: "look, I count from 1 to a million and it takes only N seconds, but if I count only the odd numbers from 1 to a million it takes 50x longer! Even if it's only half the amount of numbers!"

Which means BTW that people will start to distrust using our beautiful Iterable and will go back to writing:

for (n in (0..1000000) {
    if (n % 2 == 0) { ... }
}

gavinking commented 11 years ago

@tombentley You would not call item(). You would directly access the underlying array. In the case of a Range, you never actually instantiate the Range at all, you use a C-style for and ++ or --.

@quintesse There's absolutely no reason why we can't also optimize for (i in (0:n).by(2)) { ... }. C'mon, that's an almost trivial enhancement to the original request.

tombentley commented 11 years ago

I'm a sucker... assigning myself for now.

quintesse commented 11 years ago

@gavinking first you've changed my example and secondly you keep ignoring the fact that with trivial changes I can still make it perform bad. Are we going to add specific code for all possible methods I could call on Iterable? Inline filter() as an if-block inside the loop? Etc etc? Of course not.

BTW: I'm not saying we should not do the optimizations suggested in this issue, they're "cheap" optimizations. But if we don't do something about the generic case people will still complain, I can assure you.

quintesse commented 11 years ago

@tombentley

assigning myself for now.

Ok, but let's stick to the 3 cases mentioned in this issue for now, nothing clever

gavinking commented 11 years ago

With trivial changes you can make any code perform worse. The question is:

Should we make the most common and natural way to write something that looks like a C/Java-style for perform as well as a C/Java-style for?
Likewise, should we make the most common and natural way to iterate the most common kinds of Lists in our language as fast as iterating a Java array?

Additional, much more difficult optimizations for more exotic cases we can worry about later. But since even Java does not optimize these cases, I doubt anyone is going to complain about them very much on twitter. Let's get the low hanging fruit, the ones where we already know we perform worse than Java.

tombentley commented 11 years ago

nothing clever

Would I?

tombentley commented 11 years ago

Another thought: We could 'intern' some Integer instances, say -100..255, which would prevent allocation in a lot of common cases.

Doing this seems to result in worse performance for me. For the record I cached a single contiguous range (-100 to 255) of Integer instances in a (Java) Integer[], and added a simple test at the beginning of Integer.instance():

    if (l >= MIN_INTERNED && l <= MAX_INTERNED) {
        return INTERNED[(int)(l-MIN_INTERNED)];
    }

I had a benchmark of invoking instance() only with the cached range 100,000,000 times, which I ran 5 times. The iteration was timed using System.currentTimeMillis().

Without the cache the mean was 18352ms, with the cache 29511ms.

Conclusion: Clearly the JVM was able to do some impressive optimizations of its own with the vanilla version.

quintesse commented 11 years ago

Well that's the kind of micro-benchmark you have to be careful with. Did you actually use the resulting Integer for example?

tombentley commented 11 years ago

Fair point and no, I wasn't. But it doesn't seem to make a difference. I changed the body of the loop to:

h ^= Integer.instance(ii).hashCode();

(printing h at the end, so the VM really had to compute it). This time the non-cached version averaged ~29s, and the cache version averaged ~65s. To it's still much worse.

quintesse commented 11 years ago

Yes, and it doesn't make much sense, does it? Have you tried seeing what happens:

when you have a fixed index, let's say INTERNED[0] to see if it's the calculation
when you make the if return the same result as the "else" (hopefully the JVM isn't that clever to see the if can be removed in that case)

FroMage commented 11 years ago

I recently learned that you should use nano time rather than System.currentTimeMillis which is much slower.

FroMage commented 11 years ago

Optimising for loops dynamically by inserting duplicate code seems waaaaay wrong guys. I'd much rather work with our VM guys to see if that pattern detection can be added to hotspot.

quintesse commented 11 years ago

Well nano time is the only one that's actually correct for measuring sub-second timing differences, because the best granularity for miliseconds is about 16-17ms (on all platforms it uses the internal clock that just doesn't do better than 1/60th of a second). It doesn't drift over time though so it's good for a clock. Nanotime is really really good for short time periods, but tends to drift over time. So in this case seeing as we're talking about tens of seconds of differences it doesn't really make much of a difference ( @tombentley unless you're getting your results by adding lots of small measurements? but I guess you're measuring the time from the beginning of the test until the end, outside of the loop, right? )

tombentley commented 11 years ago

outside of the loop

Of course

quintesse commented 11 years ago

@FroMage I agree that we shouldn't be inserting duplicate code ( like, if ("x instanceof Foo") { /* use fast code */ ) else { /* use slow code */ } ). If we can detect it with static analysis in the compiler I think we could do it though.

But I don't see how we can push the responsibility to the hotspot JIT. That would mean waiting for Java 8 (or longer) before Ceylon would become fast. Or have people install a specially patched Java version just for Ceylon. Or were you thinking about something else?

FroMage commented 11 years ago

I'm saying that the best optimisations we can hope for will be done by adding patterns to hotspot for ceylon. And that this particular one seems like it can only be fixed that way properly.

tombentley commented 11 years ago

I agree that the JVM might well be a better place to do this stuff, but this talk of "patterns" seems a bit like hand waving: How on earth do we go about adding such patterns? I mean won't every JVM language want their own patterns added to hotspot? How would we ensure users were using a JITing JVM with the right 'patterns'? And what about other JVMs, such a dalvik?

FroMage commented 11 years ago

They are not hand waving. They currently already exist and this is why the JVM can replace implementations of ArrayList with arrays and a few other collections too at runtime. Which is why the java collections have better perf than anything else since a release mid-Java 6. There's a number of such hotspot patterns that make certain Java idioms a lot faster than anything else, at runtime. If any non-Java JVM language is going to get any lube out of hotspot it's only going to be possible by working with the hotspot guys to add such patterns for non-Java languages. I bet others will try it too. I know we should. And Red Hat has guys working on the JVM so it's not crazy hand-waving.

BTW, this is how Java 8 plans to implement default methods for interfaces for backwards compatibility. Older code compiled against Java 7 will automatically inherit the new methods at class-loading time on Java 8. Something we (Ceylon) can't do ATM. Runtime is a place we can't concentrate on right now, but if we're serious about having a fast language we'll have to go there, we can't rest on compile-time to make magic happen because Java doesn't.

tombentley commented 11 years ago

OK, it's great that it's not hand waving! Do you know of any links or references to this stuff thought? Speaking personally I've no idea of the optimizations hotspot is capable of. It's a black box to me. We need to understand this stuff sufficiently to figure out when leaving stuff till runtime makes more sense.

And seriously what about non-hotspot JVMs?

gavinking commented 11 years ago

All that is nice speculation but we need something simple that makes a couple of really common things fast right now.

Sent from my iPhone

On 16/11/2012, at 2:41 PM, Stéphane Épardaud notifications@github.com wrote:

They are not hand waving. They currently already exist and this is why the JVM can replace implementations of ArrayList with arrays and a few other collections too at runtime. Which is why the java collections have better perf than anything else since a release mid-Java 6. There's a number of such hotspot patterns that make certain Java idioms a lot faster than anything else, at runtime. If any non-Java JVM language is going to get any lube out of hotspot it's only going to be possible by working with the hotspot guys to add such patterns for non-Java languages. I bet others will try it too. I know we should. And Red Hat has guys working on the JVM so it's not crazy hand-waving.

BTW, this is how Java 8 plans to implement default methods for interfaces for backwards compatibility. Older code compiled against Java 7 will automatically inherit the new methods at class-loading time on Java 8. Something we (Ceylon) can't do ATM. Runtime is a place we can't concentrate on right now, but if we're serious about having a fast language we'll have to go there, we can't rest on compile-time to make magic happen because Java doesn't.

— Reply to this email directly or view it on GitHub.

FroMage commented 11 years ago

Do you know of any links or references to this stuff thought?

No I got that with discussions with Sanne and other guys in the know.

And seriously what about non-hotspot JVMs?

Those exist? ;)

All that is nice speculation but we need something simple that makes a couple of really common things fast right now.

Yes, but I'm really not sure duplicating each for loop is a good idea, that smells like bloating our bytecode.

gavinking commented 11 years ago

Well the suggested optimisation for Range does not require duplicating the for. And I think we should also be able to avoid that for the array access case also, if we are clever enough about it.

FroMage commented 11 years ago

Oh there's no doubt I absolutely agree with the trivial optimisations for ranges and segments :) And arrays if we can determine that at compile-time. Sure, that's pretty evident and dirt trivial to do.

tombentley commented 11 years ago

Of course the range optimizations will screw up interception of:

Range instantiation,
the Range.iterator and
the Iterator.next

But only in certain circumstances (i.e. when Range's type argument is Integer). So we're going to need some weasel words in the spec to permit this sort of thing.

gavinking commented 11 years ago

those words are already there ;)

Sent from my iPhone

On 17/11/2012, at 8:16 PM, Tom Bentley notifications@github.com wrote:

Of course the range optimizations will screw up interception of:

Range instantiation, the Range.iterator and the Iterator.next But only in certain circumstances (i.e. when Range's type argument is Integer). So we're going to need some weasel words in the spec to permit this sort of thing.

— Reply to this email directly or view it on GitHub.

tombentley commented 11 years ago

If you mean the words in section 4.5.6. (Restrictions on user-defined classes) then I don't think those words make clear that the optimisations can extend other objects obtained from those types (e.g. Range's Iterator).

tombentley commented 11 years ago

We now generate more optimal code for for (i in start..end) { ... } and for (i in (start..end).by(step) { ... }. At the moment a while loop is still a bit quicker, and a Java-style for is a bit quicker still. One of the reasons for this is that in Java you effectively state the direction at compile time (depending on whether you say ++/+= or --/-=), but in Ceylon we don't know at compile time which direction we're going, so we have to use conditionals. Another reason might be the termination condition in the case where there is no by(): We should be able to use an == for this, which is slightly faster, I think, than an <= or a >=. But I think we need more benchmarking to know for sure.

It would be possible to generalise this optimization a little bit I think in the case where the sequence is not a RangeOp, but is known at compile time to be a Range<Integer>. Not sure if this is worth doing yet though.

I can look at for (i in start:length) { ... }, once we've decided how that should behave on overflow (see #99).

gavinking commented 11 years ago

great!

Sent from my iPhone

On 24/11/2012, at 10:33 AM, Tom Bentley notifications@github.com wrote:

We now generate more optimal code for for (i in start..end) { ... } and for (i in (start..end).by(step) { ... }. At the moment a while loop is still a bit quicker, and a Java-style for is a bit quicker still. One of the reasons for this is that in Java you effectively state the direction at compile time (depending on whether you say ++/+= or --/-=), but in Ceylon we don't know at compile time which direction we're going, so we have to use conditionals. Another reason might be the termination condition in the case where there is no by(): We should be able to use an == for this, which is slightly faster, I think, than an <= or a >=. But I think we need more benchmarking to know for sure.

It would be possible to generalise this optimization a little bit I think in the case where the sequence is not a RangeOp, but is known at compile time to be a Range. Not sure if this is worth doing yet though.

I can look at for (i in start:length) { ... }, once we've decided how that should behave on overflow (see #99).

— Reply to this email directly or view it on GitHub.

FroMage commented 11 years ago

Yeah I don't understand why we have to do arithmetic in the end condition since it's constant at runtime, it's not like it's going to change from one iteration to another. Also I don't understand why we don't use strict comparison with ==. And last, do you know we can define multiple variables inside a for loop in Java? ;)

Did you try this with for comprehensions too?

tombentley commented 11 years ago

Let me explain. We're transforming this:

    variable Integer sum := 0;
    for (i in 1..10) {
        sum += i;
    }

Into this:

    long sum = 0L;
    final long $ceylontmpstart0 = 1L;
    final long $ceylontmpend1 = 10L;
    final boolean $ceylontmpincreasing2 = $ceylontmpstart0 <= $ceylontmpend1;
    final long $ceylontmpincr3 = $ceylontmpincreasing2 ? 1L : -1L;
    for (long i = $ceylontmpstart0; $ceylontmpincreasing2 ? i - $ceylontmpend1 <= 0L : i - $ceylontmpend1 >= 0L; i += $ceylontmpincr3) {
        sum += i;
    }

The end condition is $ceylontmpincreasing2 ? i - $ceylontmpend1 <= 0L : i - $ceylontmpend1 >= 0L because i <= $ceylontmpend1 goes wrong if i has overflowed. For example maxInteger-10..maxInteger.by(3).

We could use ==, but doing it that way (because a Range includes its end) we have to calculate the end properly taking care of overflow. It's a bit simpler in the case when there's no by() (i.e. step is 1 or -1). When step is 1 or -1 we can alternatively put the exit condition in a break after the transformed for block which avoids needing to do arithmetic which might overflow. I originally had it like that but ran across ceylon/ceylon-spec#473 and it doesn't generalize to having a different step size as easily.

On balance I didn't think it was necessarily worth the complication of having two optimizations for this at this point, until we have some decent benchmarks to show it's worth it. There are further optimizations we could do when values are known at compile time, for example. In fact I don't suffer from a lack of ideas about how we can optimize stuff. What I lack is knowing which of them are worthwhile. I haven't yet pushed some simple infrastructure I was writing to help with such benchmarking, but I plan to.

Java does let you define >1 variable in the for, but they have to be the same type:

    for (int i =0, y= 1; ;) {
    }

Lastly this doesn't apply to comprehensions because they have their own transformation for iteration. It would be nice if we could abstract it sufficiently to use the same code for comprehensions of course. But I know from when I did that with condition lists and comprehensions that it's not quick.

tombentley commented 11 years ago

Having found this (actually I've seen it before, I'd just forgotten about it) and read in particular this and this, I'm pretty sure all my benchmarks so far have been flawed in one way or another. It seems to me that macro-benchmarking is going to be more helpful (or perhaps I mean merely "helpful"). We could take something like the n-Queens solver and see how much faster it runs with optimization than without. But we need lots more of these kinds of program. And we need a framework for running them.

FroMage commented 11 years ago

OK, I suppose we can optimise cases where the range is constant and the step is 1 further, but in the absence of good benchmarks it's hard. I suppose that's a good start. Did you check the original code that the guy posted on twitter about ceylon being half as fast as java/scala with this new optimisation?

tombentley commented 11 years ago

Could someone post a link to the twitter post please?

I've also been thinking about Gavin's 3rd case: for (x in arrayOrArraySequence) There's no reason why we can't optimize this if the type is known at compile time to be a random access list. The problem is that not all Sequences support efficient random access, which narrows the scope of such an optimisation. We could introduce a RandomAccessList interface, which embodied this. That would allow Array and String plus user classes to support Iterator-less iteration, but would require more weasel words in the spec. Or we could just limit it to Array and String. The easy was to make it also work for Sequence literals ({a, b, c}) would require the inferred type to be ArraySequence, forcing us to make that part of the public API (which we were considering anyway). The hard way is to have a way of knowing that it was declared as a sequence literal at the point we wish to iterated over it.

FroMage commented 11 years ago

https://github.com/russel/Pi_Quadrature/blob/master/pi_ceylon_sequential.ceylon

ceylon / ceylon-compiler

Optimize for loop for Range, ArraySequence, and Array #882