eclipse-archived / ceylon

The Ceylon compiler, language module, and command line tools
http://ceylon-lang.org
Apache License 2.0
396 stars 62 forks source link

Optimize for loop for Range, ArraySequence, and Array #882

Closed CeylonMigrationBot closed 8 years ago

CeylonMigrationBot commented 11 years ago

[@gavinking] Our for loop is slow:

[Migrated from ceylon/ceylon-compiler#882] [Closed at 2014-05-02 10:39:53]

CeylonMigrationBot commented 11 years ago

[@tombentley] For a Sequence we'd need an instanceof to know it was an ArraySequence and thus we'd need to transform the block twice, once for the optimised loop and again for the non-optimised loop.

CeylonMigrationBot commented 11 years ago

[@gavinking] @tombentley yes. right. Indeed I think you would want to do that instanceof for any List or perhaps even for any Iterable.

CeylonMigrationBot commented 11 years ago

[@tombentley] Java has the little known java.util.RandomAccess marker interface, though I don't know where it's actually used.

We could do something similar or perhaps something along these lines:

interface For<T> satisfies Iterable<T>&Correspondence<Integer,T>&Sized {
    shared formal Boolean preferIterator;
}

Then potentially other collections can benefit.

CeylonMigrationBot commented 11 years ago

[@quintesse] @tombentley I don't think that would work, because we can only do this optimization because we know that the underlying implementation for Array and ArraySequence is a Java array. I do wonder about these optimizations though, they are okay, but if the reason is because our iterators really are much slower maybe we should find out why. And finally, optimizations are good of course, but first lets finish the language/compiler itself, because, as we can already see in these simple examples, optimizations make things a lot more complex. (Premature optimization, root-of-all-evil and such)

CeylonMigrationBot commented 11 years ago

[@gavinking] @quintesse iterators are slow because they do allocation. Iterating 0..10 allocates 11 instances of Integer.

And finally, optimizations are good of course, but first lets finish the language/compiler itself, because, as we can already see in these simple examples, optimizations make things a lot more complex.

We need to do this optimization now because it's something we already know is very inefficient and anyone who benchmarks Ceylon code will notice it immediately. People are already complaining on twitter.

CeylonMigrationBot commented 11 years ago

[@tombentley] @quintesse I don't see how that doesn't work: We don't have to guarantee that preferIterator returns the same thing on all platforms, or even for all instances of a given class. The Java ArraySequence would just override preferIterator to always return true and our loops would look something like this:

if ((seq instanceof For) && !((For)seq).getPreferIterator()) {
    // C-style loop
} else {
    // Iterator loop
}
CeylonMigrationBot commented 11 years ago

[@tombentley] BTW, I'm not saying I like the duplicated block at all. That could lead to a lot of bloat. But I don't see how we can avoid it either, while still generating optimized code.

I thought about doing some static analysis to try to only generate one of the blocks. For example if the loop is in a non-shared method and you can prove it's only ever called with an ArraySequence. But you also have to worry about method references being taken and passed outside the containing scope of the method declaration. In other words it's quite hard to do right. That kind of optimization can definitely wait, imo.

CeylonMigrationBot commented 11 years ago

[@tombentley] Once we have reified generics there are other optimisations we could do too, for example It you know you''re generating a Iterable<Integer> you can have hidden method which returns the primitive ints. Then when you know you're iterating a Iterable<Integer> you can use that method instead. This doesn't work well for our Iterator though because it returns a union (Object).

Another thought: We could 'intern' some Integer instances, say -100..255, which would prevent allocation in a lot of common cases.

CeylonMigrationBot commented 11 years ago

[@tombentley] Ditto Character, I suppose.

CeylonMigrationBot commented 11 years ago

[@quintesse] @gavinking But that's a problem we'll always have with the current implementation, fixing this only for those specific cases is nothing more than applying patches. People will still notice the bad performance the moment they change the code trivially:

value iter = 0..5;
for (i in iter) { ... }

for (i in (0..10).filter((Integer n) n % 2 == 0) { ... }

or when they start using their own generic interfaces.

This can only be solved by doing something much more clever (like making special optimized versions of classes for the basic types. There are already long discussions on the mailing list about that).

@tombentley Yes, but you gloss over the // C-style loop, how will you implement that without knowing how to get the item? You could use the item() from Correspondence but it has the same problem as Iterator: because it's generic it will have to box its result.

CeylonMigrationBot commented 11 years ago

[@quintesse]

Another thought: We could 'intern' some Integer instances, say -100..255, which would prevent allocation in a lot of common cases.

Yes definitely, that was always the idea (that's why we use instance() methods, something we still have to do for Tuple.)

CeylonMigrationBot commented 11 years ago

[@gavinking]

making special optimized versions of classes for the basic types

Sorry, but I think this is probably never going to be worth the pain and effort. And it will very likely never be as fast as a C-style for, even though it is way more work to implement.

I can't understand why you would not want to give users something that is as fast as an ordinary C-style for when they write the extremely common and natural:

for (i in 0:n) { ... }

I mean, I have seen that thousands of times in my career, and I don't think I've ever seen anyone try to write it as:

value iter = 0:n;
for (i in iter) { ... }

Seriously.

CeylonMigrationBot commented 11 years ago

[@gavinking]

We could 'intern' some Integer instances, say -100..255

This might help in business code. Would not help in scientific-computing-style code where n is very often large.

And of course it will still not be as fast as a C-style loop.

CeylonMigrationBot commented 11 years ago

[@quintesse]

Once we have reified generics there are other optimisations we could do too, for example It you know you''re generating a Iterable you can have hidden method which returns the primitive ints

We don't need reified generics for this, because that code would need to be generated statically (or we would have to generate a switch-case statement for all the possible basic types). In these examples knowing that we deal with a Iterable<Integer> should already be enough, we just need to know that those extra optimized methods exist.

There exist already ideas for this on the mailing list. One possible solution would be to always generate these optimized methods, but this results in a combinatorial explosion when you have more than one type parameter. So a second option would be to mark, with an annotation, the type parameters you want so optimized.

Anyway, you're right that this wouldn't work for Iterator because of its design.

CeylonMigrationBot commented 11 years ago

[@tombentley] @gavinking: I agree that optimizing type-parameterised stuff would be a huge pain, but Tako's right: Without it we end up boxing an int just so we can call item().

CeylonMigrationBot commented 11 years ago

[@quintesse] @gavinking

Seriously.

Yes seriously, that was just an example of how easy it would be to get non-optimized code, the moment you do anything but a simple counter you'll have bad performance again. People will go: "look, I count from 1 to a million and it takes only N seconds, but if I count only the odd numbers from 1 to a million it takes 50x longer! Even if it's only half the amount of numbers!"

Which means BTW that people will start to distrust using our beautiful Iterable and will go back to writing:

for (n in (0..1000000) {
    if (n % 2 == 0) { ... }
}
CeylonMigrationBot commented 11 years ago

[@gavinking] @tombentley You would not call item(). You would directly access the underlying array. In the case of a Range, you never actually instantiate the Range at all, you use a C-style for and ++ or --.

@quintesse There's absolutely no reason why we can't also optimize for (i in (0:n).by(2)) { ... }. C'mon, that's an almost trivial enhancement to the original request.

CeylonMigrationBot commented 11 years ago

[@tombentley] I'm a sucker... assigning myself for now.

CeylonMigrationBot commented 11 years ago

[@quintesse] @gavinking first you've changed my example and secondly you keep ignoring the fact that with trivial changes I can still make it perform bad. Are we going to add specific code for all possible methods I could call on Iterable? Inline filter() as an if-block inside the loop? Etc etc? Of course not.

BTW: I'm not saying we should not do the optimizations suggested in this issue, they're "cheap" optimizations. But if we don't do something about the generic case people will still complain, I can assure you.

CeylonMigrationBot commented 11 years ago

[@quintesse] @tombentley

assigning myself for now.

Ok, but let's stick to the 3 cases mentioned in this issue for now, nothing clever

CeylonMigrationBot commented 11 years ago

[@gavinking] With trivial changes you can make any code perform worse. The question is:

Additional, much more difficult optimizations for more exotic cases we can worry about later. But since even Java does not optimize these cases, I doubt anyone is going to complain about them very much on twitter. Let's get the low hanging fruit, the ones where we already know we perform worse than Java.

CeylonMigrationBot commented 11 years ago

[@tombentley]

nothing clever

Would I?

CeylonMigrationBot commented 11 years ago

[@tombentley]

Another thought: We could 'intern' some Integer instances, say -100..255, which would prevent allocation in a lot of common cases.

Doing this seems to result in worse performance for me. For the record I cached a single contiguous range (-100 to 255) of Integer instances in a (Java) Integer[], and added a simple test at the beginning of Integer.instance():

    if (l >= MIN_INTERNED && l <= MAX_INTERNED) {
        return INTERNED[(int)(l-MIN_INTERNED)];
    }

I had a benchmark of invoking instance() only with the cached range 100,000,000 times, which I ran 5 times. The iteration was timed using System.currentTimeMillis().

Without the cache the mean was 18352ms, with the cache 29511ms.

Conclusion: Clearly the JVM was able to do some impressive optimizations of its own with the vanilla version.

CeylonMigrationBot commented 11 years ago

[@quintesse] Well that's the kind of micro-benchmark you have to be careful with. Did you actually use the resulting Integer for example?

CeylonMigrationBot commented 11 years ago

[@tombentley] Fair point and no, I wasn't. But it doesn't seem to make a difference. I changed the body of the loop to:

h ^= Integer.instance(ii).hashCode();

(printing h at the end, so the VM really had to compute it). This time the non-cached version averaged ~29s, and the cache version averaged ~65s. To it's still much worse.

CeylonMigrationBot commented 11 years ago

[@quintesse] Yes, and it doesn't make much sense, does it? Have you tried seeing what happens:

CeylonMigrationBot commented 11 years ago

[@FroMage] I recently learned that you should use nano time rather than System.currentTimeMillis which is much slower.

CeylonMigrationBot commented 11 years ago

[@FroMage] Optimising for loops dynamically by inserting duplicate code seems waaaaay wrong guys. I'd much rather work with our VM guys to see if that pattern detection can be added to hotspot.

CeylonMigrationBot commented 11 years ago

[@quintesse] Well nano time is the only one that's actually correct for measuring sub-second timing differences, because the best granularity for miliseconds is about 16-17ms (on all platforms it uses the internal clock that just doesn't do better than 1/60th of a second). It doesn't drift over time though so it's good for a clock. Nanotime is really really good for short time periods, but tends to drift over time. So in this case seeing as we're talking about tens of seconds of differences it doesn't really make much of a difference ( @tombentley unless you're getting your results by adding lots of small measurements? but I guess you're measuring the time from the beginning of the test until the end, outside of the loop, right? )

CeylonMigrationBot commented 11 years ago

[@tombentley]

outside of the loop

Of course

CeylonMigrationBot commented 11 years ago

[@quintesse] @FroMage I agree that we shouldn't be inserting duplicate code ( like, if ("x instanceof Foo") { /* use fast code */ ) else { /* use slow code */ } ). If we can detect it with static analysis in the compiler I think we could do it though.

But I don't see how we can push the responsibility to the hotspot JIT. That would mean waiting for Java 8 (or longer) before Ceylon would become fast. Or have people install a specially patched Java version just for Ceylon. Or were you thinking about something else?

CeylonMigrationBot commented 11 years ago

[@FroMage] I'm saying that the best optimisations we can hope for will be done by adding patterns to hotspot for ceylon. And that this particular one seems like it can only be fixed that way properly.

CeylonMigrationBot commented 11 years ago

[@tombentley] I agree that the JVM might well be a better place to do this stuff, but this talk of "patterns" seems a bit like hand waving: How on earth do we go about adding such patterns? I mean won't every JVM language want their own patterns added to hotspot? How would we ensure users were using a JITing JVM with the right 'patterns'? And what about other JVMs, such a dalvik?

CeylonMigrationBot commented 11 years ago

[@FroMage] They are not hand waving. They currently already exist and this is why the JVM can replace implementations of ArrayList with arrays and a few other collections too at runtime. Which is why the java collections have better perf than anything else since a release mid-Java 6. There's a number of such hotspot patterns that make certain Java idioms a lot faster than anything else, at runtime. If any non-Java JVM language is going to get any lube out of hotspot it's only going to be possible by working with the hotspot guys to add such patterns for non-Java languages. I bet others will try it too. I know we should. And Red Hat has guys working on the JVM so it's not crazy hand-waving.

BTW, this is how Java 8 plans to implement default methods for interfaces for backwards compatibility. Older code compiled against Java 7 will automatically inherit the new methods at class-loading time on Java 8. Something we (Ceylon) can't do ATM. Runtime is a place we can't concentrate on right now, but if we're serious about having a fast language we'll have to go there, we can't rest on compile-time to make magic happen because Java doesn't.

CeylonMigrationBot commented 11 years ago

[@tombentley] OK, it's great that it's not hand waving! Do you know of any links or references to this stuff thought? Speaking personally I've no idea of the optimizations hotspot is capable of. It's a black box to me. We need to understand this stuff sufficiently to figure out when leaving stuff till runtime makes more sense.

And seriously what about non-hotspot JVMs?

CeylonMigrationBot commented 11 years ago

[@gavinking] All that is nice speculation but we need something simple that makes a couple of really common things fast right now.

Sent from my iPhone

On 16/11/2012, at 2:41 PM, Stéphane Épardaud notifications@github.com wrote:

They are not hand waving. They currently already exist and this is why the JVM can replace implementations of ArrayList with arrays and a few other collections too at runtime. Which is why the java collections have better perf than anything else since a release mid-Java 6. There's a number of such hotspot patterns that make certain Java idioms a lot faster than anything else, at runtime. If any non-Java JVM language is going to get any lube out of hotspot it's only going to be possible by working with the hotspot guys to add such patterns for non-Java languages. I bet others will try it too. I know we should. And Red Hat has guys working on the JVM so it's not crazy hand-waving.

BTW, this is how Java 8 plans to implement default methods for interfaces for backwards compatibility. Older code compiled against Java 7 will automatically inherit the new methods at class-loading time on Java 8. Something we (Ceylon) can't do ATM. Runtime is a place we can't concentrate on right now, but if we're serious about having a fast language we'll have to go there, we can't rest on compile-time to make magic happen because Java doesn't.

— Reply to this email directly or view it on GitHub.

CeylonMigrationBot commented 11 years ago

[@FroMage]

Do you know of any links or references to this stuff thought?

No I got that with discussions with Sanne and other guys in the know.

And seriously what about non-hotspot JVMs?

Those exist? ;)

All that is nice speculation but we need something simple that makes a couple of really common things fast right now.

Yes, but I'm really not sure duplicating each for loop is a good idea, that smells like bloating our bytecode.

CeylonMigrationBot commented 11 years ago

[@gavinking] Well the suggested optimisation for Range does not require duplicating the for. And I think we should also be able to avoid that for the array access case also, if we are clever enough about it.

CeylonMigrationBot commented 11 years ago

[@FroMage] Oh there's no doubt I absolutely agree with the trivial optimisations for ranges and segments :) And arrays if we can determine that at compile-time. Sure, that's pretty evident and dirt trivial to do.

CeylonMigrationBot commented 11 years ago

[@tombentley] Of course the range optimizations will screw up interception of:

But only in certain circumstances (i.e. when Range's type argument is Integer). So we're going to need some weasel words in the spec to permit this sort of thing.

CeylonMigrationBot commented 11 years ago

[@gavinking] those words are already there ;)

Sent from my iPhone

On 17/11/2012, at 8:16 PM, Tom Bentley notifications@github.com wrote:

Of course the range optimizations will screw up interception of:

Range instantiation, the Range.iterator and the Iterator.next But only in certain circumstances (i.e. when Range's type argument is Integer). So we're going to need some weasel words in the spec to permit this sort of thing.

— Reply to this email directly or view it on GitHub.

CeylonMigrationBot commented 11 years ago

[@tombentley] If you mean the words in section 4.5.6. (Restrictions on user-defined classes) then I don't think those words make clear that the optimisations can extend other objects obtained from those types (e.g. Range's Iterator).

CeylonMigrationBot commented 11 years ago

[@tombentley] We now generate more optimal code for for (i in start..end) { ... } and for (i in (start..end).by(step) { ... }. At the moment a while loop is still a bit quicker, and a Java-style for is a bit quicker still. One of the reasons for this is that in Java you effectively state the direction at compile time (depending on whether you say ++/+= or --/-=), but in Ceylon we don't know at compile time which direction we're going, so we have to use conditionals. Another reason might be the termination condition in the case where there is no by(): We should be able to use an == for this, which is slightly faster, I think, than an <= or a >=. But I think we need more benchmarking to know for sure.

It would be possible to generalise this optimization a little bit I think in the case where the sequence is not a RangeOp, but is known at compile time to be a Range<Integer>. Not sure if this is worth doing yet though.

I can look at for (i in start:length) { ... }, once we've decided how that should behave on overflow (see #99).

CeylonMigrationBot commented 11 years ago

[@gavinking] great!

Sent from my iPhone

On 24/11/2012, at 10:33 AM, Tom Bentley notifications@github.com wrote:

We now generate more optimal code for for (i in start..end) { ... } and for (i in (start..end).by(step) { ... }. At the moment a while loop is still a bit quicker, and a Java-style for is a bit quicker still. One of the reasons for this is that in Java you effectively state the direction at compile time (depending on whether you say ++/+= or --/-=), but in Ceylon we don't know at compile time which direction we're going, so we have to use conditionals. Another reason might be the termination condition in the case where there is no by(): We should be able to use an == for this, which is slightly faster, I think, than an <= or a >=. But I think we need more benchmarking to know for sure.

It would be possible to generalise this optimization a little bit I think in the case where the sequence is not a RangeOp, but is known at compile time to be a Range. Not sure if this is worth doing yet though.

I can look at for (i in start:length) { ... }, once we've decided how that should behave on overflow (see #99).

— Reply to this email directly or view it on GitHub.

CeylonMigrationBot commented 11 years ago

[@FroMage] Yeah I don't understand why we have to do arithmetic in the end condition since it's constant at runtime, it's not like it's going to change from one iteration to another. Also I don't understand why we don't use strict comparison with ==. And last, do you know we can define multiple variables inside a for loop in Java? ;)

Did you try this with for comprehensions too?

CeylonMigrationBot commented 11 years ago

[@tombentley] Let me explain. We're transforming this:

    variable Integer sum := 0;
    for (i in 1..10) {
        sum += i;
    }

Into this:

    long sum = 0L;
    final long $ceylontmpstart0 = 1L;
    final long $ceylontmpend1 = 10L;
    final boolean $ceylontmpincreasing2 = $ceylontmpstart0 <= $ceylontmpend1;
    final long $ceylontmpincr3 = $ceylontmpincreasing2 ? 1L : -1L;
    for (long i = $ceylontmpstart0; $ceylontmpincreasing2 ? i - $ceylontmpend1 <= 0L : i - $ceylontmpend1 >= 0L; i += $ceylontmpincr3) {
        sum += i;
    }

The end condition is $ceylontmpincreasing2 ? i - $ceylontmpend1 <= 0L : i - $ceylontmpend1 >= 0L because i <= $ceylontmpend1 goes wrong if i has overflowed. For example maxInteger-10..maxInteger.by(3).

We could use ==, but doing it that way (because a Range includes its end) we have to calculate the end properly taking care of overflow. It's a bit simpler in the case when there's no by() (i.e. step is 1 or -1). When step is 1 or -1 we can alternatively put the exit condition in a break after the transformed for block which avoids needing to do arithmetic which might overflow. I originally had it like that but ran across #3579 and it doesn't generalize to having a different step size as easily.

On balance I didn't think it was necessarily worth the complication of having two optimizations for this at this point, until we have some decent benchmarks to show it's worth it. There are further optimizations we could do when values are known at compile time, for example. In fact I don't suffer from a lack of ideas about how we can optimize stuff. What I lack is knowing which of them are worthwhile. I haven't yet pushed some simple infrastructure I was writing to help with such benchmarking, but I plan to.

Java does let you define >1 variable in the for, but they have to be the same type:

    for (int i =0, y= 1; ;) {
    }

Lastly this doesn't apply to comprehensions because they have their own transformation for iteration. It would be nice if we could abstract it sufficiently to use the same code for comprehensions of course. But I know from when I did that with condition lists and comprehensions that it's not quick.

CeylonMigrationBot commented 11 years ago

[@tombentley] Having found this (actually I've seen it before, I'd just forgotten about it) and read in particular this and this, I'm pretty sure all my benchmarks so far have been flawed in one way or another. It seems to me that macro-benchmarking is going to be more helpful (or perhaps I mean merely "helpful"). We could take something like the n-Queens solver and see how much faster it runs with optimization than without. But we need lots more of these kinds of program. And we need a framework for running them.

CeylonMigrationBot commented 11 years ago

[@FroMage] OK, I suppose we can optimise cases where the range is constant and the step is 1 further, but in the absence of good benchmarks it's hard. I suppose that's a good start. Did you check the original code that the guy posted on twitter about ceylon being half as fast as java/scala with this new optimisation?

CeylonMigrationBot commented 11 years ago

[@tombentley] Could someone post a link to the twitter post please?

I've also been thinking about Gavin's 3rd case: for (x in arrayOrArraySequence) There's no reason why we can't optimize this if the type is known at compile time to be a random access list. The problem is that not all Sequences support efficient random access, which narrows the scope of such an optimisation. We could introduce a RandomAccessList interface, which embodied this. That would allow Array and String plus user classes to support Iterator-less iteration, but would require more weasel words in the spec. Or we could just limit it to Array and String. The easy was to make it also work for Sequence literals ({a, b, c}) would require the inferred type to be ArraySequence, forcing us to make that part of the public API (which we were considering anyway). The hard way is to have a way of knowing that it was declared as a sequence literal at the point we wish to iterated over it.

CeylonMigrationBot commented 11 years ago

[@FroMage] https://github.com/russel/Pi_Quadrature/blob/master/pi_ceylon_sequential.ceylon