aol / cyclops

An advanced, but easy to use, platform for writing functional applications in Java 8.
Apache License 2.0
1.31k stars 136 forks source link

Make shuffle lazy / intermediate operation #99

Closed johnmcclean closed 8 years ago

johnmcclean commented 8 years ago

Current shuffle implementation consumes the stream before shuffling. A reasonable compromise would be to group by some largeish number before before shuffling that collection, followed by flattening to a single traversable again.

lukaseder commented 8 years ago

I had thought about this, too, when implementing Seq.shuffle(), but is that sufficiently random for most cases? E.g. how often do you get a corner case where a "largish" number of A values precede a "largish" number of B values?

johnmcclean commented 8 years ago

Good point - there probably isn't a perfect solution to this (at least I can't think of one yet). A new operator (or two equivalents for shuffle() and shuffle(random)) might help - e.g.

public Seq<T> shuffle(long shuffleGroupSize);

Which would leave the question as to whether shuffle() should be intermediate with a default batch size, or consume the Stream?

There are limitations with infinite Streams if we consume the whole stream before shuffling - but also other challenges remain even with grouping e.g.

(Reactive)Seq.generate(this::nextElement()).shuffle().limit(100);

Ideally the limit should be applied to the group size within shuffle also (so we don't generate more than 100 elements), but I suspect the cost / benefit of implementing that kind of logic is not worth it.

It's mentioned in a blog post here by the way -> https://habrahabr.ru/post/262139/

lukaseder commented 8 years ago

There are limitations with infinite Streams if we consume the whole stream before shuffling

Indeed, shuffle doesn't make sense (or at least, it's dangerous) with infinite streams. But then again, the same is true for vanilla JDK's Stream.distinct() or Stream.sorted(). Both are not possible to implement accurately without buffering the entire stream, see for instance java.util.stream.SortedOps.RefSortingSink, which collects the entire "upstream" into an ArrayList. The question is really how much we value accuracy for shuffle().

It's mentioned in a blog post here by the way -> https://habrahabr.ru/post/262139/

Yes, I've seen that. @amaembo keeps trying desperately to get more stars than jOOλ for his https://github.com/amaembo/streamex library ;-)

amaembo commented 8 years ago

Hello guys. As I was highlighted here, let me comment as well :-)

The problem with jOOλ shuffle (as well as with some other operations) is that they are absolutely eager. Every intermediate operation in normal Stream API (even including sorted()) does not do anything and does not consume elements until terminal operation performed and until the element is actually requested. However jOOλ shuffle only looks like intermediate operation, it's actually terminal+stream source. To illustrate this, consider, for example, such code:

List<Integer> list = IntStream.range(0, 1000000).boxed().collect(Collectors.toList());
System.out.println(Seq.seq(list).shuffle().prepend(1).findFirst().get());

Here we don't need to perform shuffle at all, but it's actually performed. If we replace shuffle() with sorted() then it will be correctly skipped.

The Stream API philosophy assumes that even full-barrier intermediate operation (like sorted()) is lazy. I respect this philosophy in my library, I have no eager intermediate operations at all. Probably it's not important for jOOλ and simple-react.

In StreamEx, I don't have shuffle at all, because to my opinion it's rarely necessary. If it's necessary, one can easily implement it though using kick-off feature headTail() (see the Stream.sorted() section in the linked blogpost, just replace sort(null) with shuffle()). Such implementation is actually lazy and it can be comfortably used with chain() method ( @lukaseder named such thingy as transform()).

lukaseder commented 8 years ago

The problem with jOOλ shuffle (as well as with some other operations) is that they are absolutely eager.

Huh, you're right! This should be easy to implement lazily, though. Will fix : https://github.com/jOOQ/jOOL/issues/195. Thanks for the feedback!

Probably it's not important for jOOλ and simple-react.

It is important, and it's an oversight.

@lukaseder named such thingy as transform()

Big Brother is watching me! :)

danieldietrich commented 8 years ago

@lukaseder @amaembo you are not alone 👽 too many throwing function and stream extensions here at github

johnmcclean commented 8 years ago

It's simple enough to shuffle within a flatMap operator after grouping, so I think this is a reasonable approach - https://github.com/jOOQ/jOOL/issues/195

I'll close this issue and keep an eye on the jOOλ one, I can have a crack at implementing it if you like?

Thanks for your input guys! @lukaseder @amaembo @danieldietrich

lukaseder commented 8 years ago

I can have a crack at implementing it if you like?

Of course! :) Very appreciated!