parellel iteration in for

gavinking commented 9 years ago

Currently:

To iterate the elements of two streams in parellel, you have to use for ([x,y] in zipPairs(xs,ys)).
To iterate a stream with its indices, you have to use for (i->x in xs.indexed).

The downside of this is the need to do an instantiation for each element.

I propose to let you write:

for (x in xs, y in ys) { ... }

And

for(i in xs.keys, x in xs) { ... }

Which will likely exhibit much superior performance.

gavinking commented 9 years ago

To clarify: { for (x in xs, y in ys) [x,y] } is not the same thing as { for (x in xs) for (y in ys) [x,y] }. The semantics are parellel, not nested, iteration.

jean-morissette commented 9 years ago

Do you mean you are not computing the cartesian product? This would imply the size of xs must be equal to the size of ys...

gavinking commented 9 years ago

Do you mean you are not computing the cartesian product?

Right.

This would imply the size of xs must be equal to the size of ys...

No, we stop the iteration as soon as any one of the iterators runs out of elements.

gavinking commented 9 years ago

Typechecker side of this now implemented in #1174.

gdejohn commented 9 years ago

Is this limited to two streams at once?

NiematojakTomasz commented 9 years ago

Isn't this optimization possible to implement without introducing new syntax?

I think this syntax is very confusing. I would think that it means cartesian product if I saw a piece of code without reading the spec.

Personally I would prefer to write:

for([x, y] in zipPairs(xs, ys)) { ... }

So no one would have to think twice when reviewing my code.

Also

for (i->x in xs.indexed)

is also far more expressive than:

for(i in xs.keys, x in xs)

and how would parallel iteration behave for Maps?

for(key in xmap.keys, item in xmap.items)

gdejohn commented 9 years ago

I think this syntax is very confusing.

Confusing how? It's easy to understand (you seem to, at least). It's easy to explain, judging by the simplicity of the original post. It's not doing anything complicated under the covers. A feature can still be worthwhile even if it's not immediately understood by everyone without any explanation, right? And even if you don't grant that, this particular feature strikes me as pretty intuitive.

Isn't this optimization possible to implement without introducing new syntax?

Are you suggesting that things like zipPairs() be treated as a special case by the compiler and do something totally irregular? That would be confusing. And it's not like this syntax is unfamiliar. It's just a comma-separated list, a very natural extension of how the for loop already works.

and how would parallel iteration behave for Maps?

This is about iterating separate streams in parallel while avoiding the creation of unnecessary objects, for the sake of efficiency. You wouldn't do this with a map, since in that case the keys and items are already associated with each other.

gavinking commented 9 years ago

Is this limited to two streams at once?

No.

gavinking commented 9 years ago

Isn't this optimization possible to implement without introducing new syntax?

In principle, yes. But to be honest I find the proposed syntax slightly easier to understand than the thing it replaces.

I think this syntax is very confusing. I would think that it means cartesian product if I saw a piece of code without reading the spec.

I agree that this is an alternative possible meaning, but I figure that this is a confusion that can easily be addressed by clearly explaining the semantics when we first introduce people to the for loop.

how would parallel iteration behave for Maps?
for(key in xmap.keys, item in xmap.items)

Well, for a sorted map that code would work, but not for an unsorted map. But sure, it definitely wouldn't be the idiom for iterating a Map.

NiematojakTomasz commented 9 years ago

The downside of this is the need to do an instantiation for each element.

I was curious if it isn't already optimized by jvm. It's quite common case when iterating things like maps. So I wrote a little amateurish benchmark to compare those two approaches to iterating pairs. I saw no performance improvement for java code.

Here is the benchmark: https://github.com/NiematojakTomasz/parallel-iteration-benchmark/tree/master

I know my benchmark have few downsides:

It doesn't definitely applies to ceylon code running on jvm and definitely do not apply to js backend. I can port those benchmarks to ceylon if you're interested.
It doesn't addresses example with zipPairs. I guess it would decrease performance a little, but may be optimized after warm up.
It didn't measured how much resources garbage collector used.

NiematojakTomasz commented 9 years ago

I agree that this is an alternative possible meaning, but I figure that this is a confusion that can easily be addressed by clearly explaining the semantics when we first introduce people to the for loop.

It's a step away from one of goals in ceylon language design mentioned here.

be easy to read and understand, even for beginners, even for non-Ceylon-programmers reading your Ceylon code on your blog or on GitHub,

gdejohn commented 9 years ago

It's a step away from one of goals in ceylon language design mentioned here.

Considering for (x in xs, y in ys) vs. for ([x,y] in zipPairs(xs,ys)), I don't think it would be significantly harder for beginners to understand the former. (I think it could very well be easier.)

NiematojakTomasz commented 9 years ago

@gdejohn

Considering for (x in xs, y in ys) vs. for ([x,y] in zipPairs(xs,ys)), I don't think it would be significantly harder for beginners to understand the former. (I think it could very well be easier.) It depends how often you will use this construct. As for non-Ceylon-programmers it may be very confusing. I think for ([x,y] in zipPairs(xs,ys)) is more expressive as we are explicit/verbose in what we want to do. Especially if you would write something like { for (x in xs, y in ys) [x,y] }which is very similar to mathematical notation where it would mean cartesian product.

It is easier to find in docs what zipPairs means than for (x in xs, y in ys) in language specification.

And there would come some other caveats that would come with this feature:

This would imply the size of xs must be equal to the size of ys...

No, we stop the iteration as soon as any one of the iterators runs out of elements.

I also think that having many features in language may make it harder to read and to learn.

gdejohn commented 9 years ago

It is easier to find in docs what zipPairs means thanfor (x in xs, y in ys) in language specification.

I was referring to the destructuring going on with [x,y], which also necessitates a trip to the language specification for anyone who hasn't seen it before. Why would beginners find that easier to understand than parallel iteration?

And there would come some other caveats that would come with this feature:

This would imply the size of xs must be equal to the size of ys...

No, we stop the iteration as soon as any one of the iterators runs out of elements.

That also applies to zipPairs() and friends.

I also think that having many features in language may make it harder to read and to learn.

Sure, but this isn't the straw that'll break the camel's back.

gavinking commented 9 years ago

Sure, but this isn't the straw that'll break the camel's back.

I don't think so either. And from that point of view, yes, indeed, the new destructuring syntax coming in 1.2 invites way more skepticism.

jean-morissette commented 9 years ago

IMHO, I don't like this language feature.

It does not add expressive power since we can achieve the same already otherwise (plus with an already simple synthax).

I would prefer a language with as few primitives as possible, which you can compose to achieve higher level functionalities than a language with one primitive per functionality.

Maybe you could put it on ice for now and come back to it later once we get more feedback and had more time to think about it? We can always add this feature later.

gdejohn commented 9 years ago

It does not add expressive power since we can achieve the same already otherwise (plus with an already simple synthax).

I'm confused by how you and @NiematojakTomasz are using the word "expressive." This feature does strike me as more expressive than the alternative because it does the job in a more direct manner (both conceptually and in terms of character count). And in any case, this was proposed for the sake of efficiency.

RossTate commented 9 years ago

Two observations:

The benchmark is not an accurate benchmark, assuming I read it correctly. zipPairs would create a new collection of new pairs every time the loop is reached. It's possible the JVM has improved since I last did benchmarking on it, but it used to be that it wasn't good at avoiding/optimizing even these sorts of allocations.
Unlike zipPairs, this feature cleanly generalizes to triples.

On Saturday, December 27, 2014, Griffin DeJohn notifications@github.com wrote:

It does not add expressive power since we can achieve the same already otherwise (plus with an already simple synthax).

I'm confused by how you and @NiematojakTomasz https://github.com/NiematojakTomasz are using the word "expressive." This feature does strike me as more expressive than the alternative because it does the job in a more direct manner (both conceptually and in terms of character count). And in any case, this was proposed for the sake of efficiency.

— Reply to this email directly or view it on GitHub https://github.com/ceylon/ceylon-spec/issues/1174#issuecomment-68171126.

NiematojakTomasz commented 9 years ago

zipPairs would create a new collection of new pairs every time the loop is reached.

zipPairs doesn't create new collection. It simply generates iterator which holds two iterators inside. (see mapPairs)

The benchmark is not an accurate benchmark, assuming I read it correctly

Yes, but I have listed it's holes. Just to make sure - you haven't read only Test1? Test1 stores array of pairs. Test2 stores two separate arrays for keys and integers. Test3 stores array of values. Test4 doesn't store anything.

Tests 2-4 compose Pair on each call of Iterator.next.

It's possible the JVM has improved since I last did benchmarking on it, but it used to be that it wasn't good at avoiding/optimizing even these sorts of allocations.

Is your benchmark available somewhere?

Unlike zipPairs, this feature cleanly generalizes to triples.

Yes. Currently we would have to write something like:

for ([x,y,z] in zip(zipPairs(xs,ys),zs))

And it's not the only place this inconvenience appears. If I would like to aggregate n functions into function which take tuple of arguments to those functions and return tuple it wouldn't be easier.

I think something like

for([x,y,z] in altZip(zs).join(ys).join(xs))

would be possible using current language features. Unfortunately in reverse order. Probably impossible due to right recursion in Tuple type parameters (like in Tuple.withTrailing).

gavinking commented 9 years ago

zipPairs doesn't create new collection. It simply generates iterator which holds two iterators inside.

But each step of the iteration it needs to instantiate a tuple object.

NiematojakTomasz commented 9 years ago

@gavinking Yes, but in my benchmark on each step of iteration were instantiated Pair object. It was java object so it was far more simple. Didn't even has runtime type parameters values. Still I don't like the syntax and thats why I'm digging into alternative solution. I'm quite convinced there are far more generic optimizations possible, but quite complex, that wouldn't involve introducing unintuitive syntax sugar. The benchmark I made was primary to test how HotSpot handles short living return values. I'm currently working on ceylon one. Results in few minutes.

NiematojakTomasz commented 9 years ago

@RossTate I made another benchmark in ceylon. I am quite surprised by how more efficient is parallel iteration is in ceylon. But I am curious if it isn't optimizable in some other more generic way (alternative methods compilation for short living return values).

Many languages introduces various features that were later labeled - avoid this. And I have a bad feeling about this feature. Performance - We can tell how to transform code to be more efficient. We can implement it. The issue now, I guess, are limited resources.

gavinking commented 9 years ago

Some additional thoughts.

First, I agree it doesn't really do much to increase the total expressive power of the language. You can already write this today, like so:

value xiter = xs.iterator();
value yiter = ys.iterator();
while (!is Finished x = xiter.next(), 
       !is Finished y = yiter.next()) {
    ...
}

But we can make the exact same argument for the for loop as it already exists today. It's basically just doing what you can already do with while. So we can always view for as just syntax sugar. (Now, that's not strictly true, since the typechecker does do some special reasoning about for loops an nonemptiness when it checks definite assignment and reachability, but let's set that aside.)

The truth is that nobody wants to write the above with while. So in practice they will use zip() and zipPairs(), which comes at a performance cost. Yes, in certain cases we would in principle be able to optimize away that cost by clever compiler tricks, but the problem with that kind of thing is that you wind up with this:

for ([x,y] in zipPairs(xs, ys)) { ... }

having very different performance characteristics to this:

value pairs = zipPairs(xs, ys);
for ([x,y] in pairs) { ... }

Second, I agree that there is obvious potential for misunderstanding: the first time a developer new to Ceylon encounters this construct, they clearly are going to wonder whether it is doing nested or parallel iteration. I'm convinced that this problem can be solved by documenting the semantics, and it's highly unlikely to ever cause confusion more than once per developer. The proposed semantics for this feature aren't difficult to understand. And, indeed, when you reflect on it, it's the most reasonable thing for that syntax to mean.

Third, I would point out that in some other control structures: if, while, and try, we already allow comma-separated lists. And if we do eventually decide to add pattern matching to the language we will also allow comma-separated lists in switch and case. So in terms of the overall syntactic look of the language, this is perfectly regular.

Fourth, to me this syntax is more readable than the alternative with zipPairs(), and significantly more readable than the alternative with while.

Finally, yes, I'm in total agreement that every extra thing we throw into the language comes at a cost, making the language more difficult to learn, and the compiler more difficult to maintain. Every language feature has a "weight". I definitely don't want to keep throwing new constructs into the language every few months. That's just not the goal we have at all. (Which is why you see me so ambivalent about proposals like pattern matching.) But this isn't actually a new feature. We're making a pre-existing feature, for loops, more powerful, thus, to my mind, increasing the power/weight ratio of a construct we already have. The only new syntax here is a comma ;-)

NiematojakTomasz commented 9 years ago

I forgot to link the benchmark.

gavinking commented 9 years ago

I am quite surprised by how more efficient is parallel iteration is in ceylon.

It's not surprising. Allocating an object is very expensive on the JVM. And here's it's happening at every step of the loop.

But I am curious if it isn't optimizable in some other more generic way (alternative methods compilation for short living return values).

Not really, at least not until the JVM itself supports value types. There are some special cases we can hack in, but it would be very difficult for the programmer to predict which code is efficient and which is not. In general it's a big problem that the JVM does not yet support stack allocation.

NiematojakTomasz commented 9 years ago

It's not surprising. Allocating an object is very expensive on the JVM. And here's it's happening at every step of the loop.

Well it seemed to bit quite efficient in my java benchmark. But I'll try to improve it to be more similar to my ceylon ones.

Not really, at least not until the JVM itself supports value types

Yes, it's more than a year of waiting. But once you introduce feature it will stay forever. And I was actually thinking about allocating single mutable object on stack and passing it to alternatively compiled version of iterator so it would set it's fields (I know - too complicated). But beside that(complicated stuff) - I'm curious what makes so much difference in performance between Ceylon and Java and dig a little bit more.

gavinking commented 9 years ago

By the way, on a tangent, this is an interesting microbenchmark to run:

void bench<T>(T f()) {
    Integer start = system.nanoseconds;
    for (i in 0:5k) {
        f();
    }
    print ("``(system.nanoseconds - start)/1M``");
}

class Pair(shared Integer x, shared Integer y) {}

shared void perf() {
    for (i in 1..5) {
        bench(() => { for (i in 0..1k) Pair(i,i*2) }.find((pair) => pair.x>pair.y));
        bench(() => { for (i in 0..1k) i->i*2 }.find((pair) => pair.key>pair.item));
        bench(() => { for (i in 0..1k) [i,i*2] }.find((pair) => pair[0]>pair[1]));
    }
}

It shows Entry to be approx 50% slower than the non-generic class Pair, and Tuple to be about 100% slower than Entry. We should investigate if it's possible to make Tuple instantiation cheaper, especially for the common case of pairs. I bet it is possible.

gavinking commented 9 years ago

I'm curious what makes so much difference in performance between Ceylon and Java and dig a little bit more.

Java's iterators are a bit faster in micro-benchmarks, for no apparently good reason. I bet there are some weird optimizations deep in the VM. I have been able to prove a difference between iterating with hasNext() and next() compared to Ceylon's next() that simply makes no intuitive sense at all.

NiematojakTomasz commented 9 years ago

I run a sampler on on benchmark:

JAVA_OPTS="-agentlib:hprof=cpu=samples,interval=1,file=ceylon-parallel-iteration-benchmark.hprof" ceylon run com.github.niematojaktomasz.ceylonparalleliterationbenchmark.test3

(heap=sites and cpu=times crashes so I am unable to gather any more detailed statistics)

It seems most of the time takes instantiation of ceylon objects. Mostly BaseSequence. (I know it's some java backend behind curtains things)

Part of the ceylon-parallel-iteration-benchmark.hprof file:

CPU SAMPLES BEGIN (total = 9078) Sat Dec 27 15:03:19 2014
rank   self  accum   count trace method
   1 34,45% 34,45%    3127 301234 ceylon.language.impl.BaseSequence.<init>
   2 14,33% 48,78%    1301 301233 com.github.niematojaktomasz.ceylonparalleliterationbenchmark.test3.run_.run
   3  5,90% 54,68%     536 301231 ceylon.language.impl.BaseSequence.<init>
   4  5,78% 60,46%     525 301239 ceylon.language.impl.BaseSequence.<init>
   5  5,77% 66,24%     524 301237 ceylon.language.impl.BaseSequence.<init>
   6  5,75% 71,99%     522 301236 ceylon.language.impl.BaseSequence.<init>
   7  5,23% 77,22%     475 301232 ceylon.language.impl.BaseSequence.<init>
   8  4,79% 82,01%     435 301240 ceylon.language.impl.BaseSequence.<init>
   9  4,47% 86,48%     406 301244 com.github.niematojaktomasz.ceylonparalleliterationbenchmark.test3.run_.run
  10  0,89% 87,38%      81 300946 java.lang.ClassLoader.defineClass1
  11  0,43% 87,81%      39 300102 java.lang.ClassLoader.defineClass1
  12  0,33% 88,14%      30 301224 ceylon.language.Integer.instance

I don't blame ceylon for that as it have superior language design ; )

Just wanted to complement my benchmarks with some relevant data.

NiematojakTomasz commented 9 years ago

By the way, on a tangent, this is an interesting microbenchmark to run:

bench(() => { for (i in 0..1k) Pair(i,i*2) }.find((pair) => pair.x>pair.y));

CPU SAMPLES BEGIN (total = 1465) Sat Dec 27 17:23:54 2014
rank   self  accum   count trace method
   1 12,29% 12,29%     180 301208 ceylon.language.Integer.<init>
   2 11,33% 23,62%     166 301199 ceylon.language.Iterable$impl.find
   3  6,76% 30,38%      99 301200 com.github.niematojaktomasz.ceylonparalleliterationbenchmark.testgavin1.bench_.bench
   4  4,91% 35,29%      72 300935 java.lang.ClassLoader.defineClass1
   5  4,51% 39,80%      66 300144 java.util.zip.ZipFile.read
   6  2,66% 42,46%      39 300108 java.lang.ClassLoader.defineClass1
   7  2,25% 44,71%      33 301009 java.io.FileInputStream.readBytes
   8  2,05% 46,76%      30 300087 java.util.zip.ZipFile.open
   9  1,57% 48,33%      23 301205 com.github.niematojaktomasz.ceylonparalleliterationbenchmark.testgavin1.bench_.bench
  10  1,43% 49,76%      21 300948 java.util.zip.Inflater.inflateBytes
  11  1,37% 51,13%      20 300468 java.util.zip.ZipFile.getNextEntry
  12  1,16% 52,29%      17 300148 java.util.zip.Inflater.inflateBytes

bench(() => { for (i in 0..1k) i->i*2 }.find((pair) => pair.key>pair.item));

CPU SAMPLES BEGIN (total = 1670) Sat Dec 27 17:09:26 2014
rank   self  accum   count trace method
   1 19,64% 19,64%     328 301263 ceylon.language.Iterable$impl.find
   2 14,85% 34,49%     248 301276 ceylon.language.Integer.<init>
   3  4,73% 39,22%      79 300952 java.lang.ClassLoader.defineClass1
   4  4,07% 43,29%      68 301273 com.github.niematojaktomasz.ceylonparalleliterationbenchmark.testgavin1.bench_.bench
   5  2,63% 45,93%      44 300106 java.lang.ClassLoader.defineClass1
   6  1,14% 47,07%      19 301267 com.github.niematojaktomasz.ceylonparalleliterationbenchmark.testgavin1.bench_.bench
   7  1,08% 48,14%      18 300108 java.lang.ClassLoader.findBootstrapClass
   8  0,96% 49,10%      16 301047 java.io.FileInputStream.readBytes
   9  0,96% 50,06%      16 301266 com.github.niematojaktomasz.ceylonparalleliterationbenchmark.testgavin1.perf_$1$2$1$1.next
  10  0,78% 50,84%      13 300451 java.util.zip.ZipFile.getNextEntry

bench(() => { for (i in 0..1k) [i,i*2] }.find((pair) => pair[0]>pair[1]));

CPU SAMPLES BEGIN (total = 2730) Sat Dec 27 17:08:25 2014
rank   self  accum   count trace method
   1 25,86% 25,86%     706 301184 ceylon.language.impl.BaseSequence.<init>
   2 10,40% 36,26%     284 301188 ceylon.language.Iterable$impl.find
   3  5,09% 41,36%     139 301181 ceylon.language.impl.BaseSequence.<init>
   4  4,14% 45,49%     113 301191 ceylon.language.impl.BaseSequence.<init>
   5  4,10% 49,60%     112 301195 ceylon.language.impl.BaseSequence.<init>
   6  4,07% 53,66%     111 301192 ceylon.language.impl.BaseSequence.<init>
   7  4,07% 57,73%     111 301193 ceylon.language.impl.BaseSequence.<init>
   8  4,03% 61,76%     110 301186 ceylon.language.impl.BaseSequence.<init>
   9  3,81% 65,57%     104 301194 com.github.niematojaktomasz.ceylonparalleliterationbenchmark.testgavin1.bench_.bench
  10  2,97% 68,53%      81 300905 java.lang.ClassLoader.defineClass1
  11  1,83% 70,37%      50 300099 java.lang.ClassLoader.defineClass1
  12  0,77% 71,14%      21 301190 com.github.niematojaktomasz.ceylonparalleliterationbenchmark.testgavin1.bench_.bench
  13  0,73% 71,87%      20 300439 java.util.zip.ZipFile.getNextEntry

Hprof samples for your benchmark for comparison. For each test I commented out other lines.

It looks like refined generics costs a lot. My guess is that constructing generic parameter types costs many times more than everything else.

quintesse commented 9 years ago

My guess is that constructing generic parameter types costs many times more than everything else.

@NiematojakTomasz Where do you base that on looking at those numbers?

gavinking commented 9 years ago

It seems most of the time takes instantiation of ceylon objects. Mostly BaseSequence

Ahyes, of course, because BaseSequence mixes in a bunch of interfaces with concrete members. So creating a BaseSequence means instantiating like 7 Xxxx$impl classes.

My guess is that constructing generic parameter types costs many times more than everything else.

No, I very much doubt it's anything to do with generics. It's to do with our currently-inefficient implementation of mixin inheritance. I had forgotten that this impacts Tuple. So I guess this performance bug will go away once we redo mixin inheritance after migrating to javac 8.

NiematojakTomasz commented 9 years ago

@NiematojakTomasz Where do you base that on looking at those numbers?

I tried to make sense of:

BaseSequence constructor consuming that much time.
Find method consuming more self time for generic Entry class than for non-generic Pair.

But I don't know much how backend is implemented. Gavin is probably right.

FroMage commented 9 years ago

The problem with this is that most people will actually want parallel (as in asynchronous) execution of generators, like they use with Future composition using for comprehensions in Scala. I'm not sure it's possible to reconcile both synchronous parallel iteration and the composition behaviour of Scala's comprehensions in our case…

gavinking commented 9 years ago

I don't follow. What I'm proposing has nothing to do with Scala for comprehension. I didn't even notice until just now that the syntax looks similar.

FroMage commented 9 years ago

I know it has nothing to do with it. What I'm saying is that I think people will expect it to have to do with it ;)

NiematojakTomasz commented 9 years ago

I didn't even notice until just now that the syntax looks similar.

And it means cartesian product. I thought so, some other language have such feature ; )

For other people who haven't googled it yet - http://docs.scala-lang.org/tutorials/tour/sequence-comprehensions.html

gdejohn commented 9 years ago

And it means cartesian product.

@NiematojakTomasz, I strongly agree with @gavinking that while there is slight potential for confusion with the Cartesian product upon seeing this syntax for the very first time, such a misunderstanding is only gonna happen at most once per person. This feature is very regular with the rest of the language, and it's simple and intuitive enough that a bit of documentation will suitably address the concerns here. This is the right way to do parallel iteration, and nested comprehensions are the right way to do Cartesian products.

I think people will expect it to have to do with [Scala for comprehensions].

@FroMage, I suppose Scala people would, but would anyone else? And how many Scala people are there? And anyway, I think this would be just one more departure from Scala that reflects better on Ceylon.

FroMage commented 9 years ago

To be clear, I like the current proposal personally, but I've heard feedback from Ceylon talks where people asked me explicitly if our comprehensions allowed for composition like Scala's and so I'm merely relaying user feedback. Perhaps there's a way to reconcile both? I doubt it because I think Scala's comprehensions are based on Gonads something, while ours are not.

gavinking commented 9 years ago

@FroMage we already have a syntax that works like Scala's comprehensions. That is, if I understand correctly, this:

for (x <- xs; y <- ys) yield new Point(x,y)

Is written like this in Ceylon:

for (x in xs) for (y in ys) Point(x,y)

Except that, as you mention, Scala's comprehensions are defined in terms of flatMap() whereas ours are defined in terms of iterators.

gavinking commented 9 years ago

Even after this discussion, and after reflection, I'm think I'm still in favor of doing this. But it's not for 1.2.

luolong commented 8 years ago

I'd cast my vote in favor of this proposal as well.

ceylon / ceylon-spec

parellel iteration in for #1174