atnos-org / origami

Monadic folds
MIT License
70 stars 8 forks source link

Add Aggregator typelclass #5

Closed ghost closed 5 years ago

ghost commented 7 years ago

Ref https://github.com/typelevel/cats/issues/1360

I think having an Aggregator somewhere would be handy, and origami looks a good home.

Time permitting, I would be happy to help out here.

ghost commented 7 years ago

Even with a provisional :+1:, the cats issue would be able to be closed, hence a /CC to @johnynek as a cats maintainer

etorreborre commented 7 years ago

So there's no agreement right now that the Aggregator typeclass should be in cats, right? What could be done for now is an origami-twitter module with a method to transform some of origami Folds into Twitter Aggregator instances.

johnynek commented 7 years ago

Aggregator is parallel. Fold is sequential. Algebird has both.

Aggregator is basically: T => A, Semigroup[A], A => B and can be an applicative. MonoidAggregator is the same with a Monoid.

If I could do it again, I would do:

trait Aggregator[-I, +O] {
  type A
  def prepare(i: I): A
  def semigroup: Semigroup[A]
  def present(a: A): O
}

(hide the middle type in a type member).

Since a semigroup can be applied in parallel (since it is associative), the Aggregator can.

Also, CommutativeAggregator is interesting, since in map-reduce settings (like spark, hadoop), you can do pre-aggregation before you group by the key to finish the combining.

Not the same as Fold. All Aggregators can be a Fold, but not vice-versa.

etorreborre commented 7 years ago

Thanks for the explanation, now I understand. I'm wondering if we generalize this to Applicative:

trait Aggregator[F[_], -I, +O] {
  type A
  val applicative: Alternative[F]
  def prepare(i: I): F[A]
  def present(a: A): F[O]
}

I think that Const[A, B] where there A : Monoid gives rise to an Alternative[Const[A, ?]] no? Then we can even have some effects in our Aggregators.

johnynek commented 7 years ago

I haven't thought about it. There may be some useful examples, but algebird has MANY Aggregator examples. So many things can be expressed that way (virtually everything anyone uses from SQL for instance). We got by at Stripe with almost all our feature generation for a long time being expressed as Aggregators.

So, I would strongly encourage you to have this type since so many computations fall into the pattern. For Alternative, you have to be prepare to make a monoid for all A, no? That seems to dramatically weaken the number of possible implementations since you can't specify the monoid based on A, which is quite useful.

etorreborre commented 7 years ago

For Alternative, you have to be prepare to make a monoid for all A, no?

That was the purpose of my comment about using Const for that case. I need to try it out and come back to you.

ghost commented 7 years ago

I also agree this type is wonderfully powerful yet deceptively simple. My commercial use case was for a distributed, big data database for multi-dimensional time series. The exact dimension was a runtime parameter, so the monoid's zero also only known at runtime. So I had to use a Semigroup, in this case a CommutativeSemigroup was a better fit.

So this is why I'm such a fan of the Aggregator but also, adding to the discussion above, why I think it be best not to have it in cats; all the other examples make sense too so why not have all of them?

I'm sure a quick flick through some literature/haskell libs would yield a whole bunch of other additions for folds, but would bloat out cats. So why not have the fold as-is in cuddly-cats, and have origami as the goto for all the rest?

ghost commented 7 years ago

wonderfully powerful yet deceptively simple

Hmm, perhaps that should have been "deceptively powerful yet wonderfully simple". I'm sure you get my gist :wink:

More importantly is to /CC @benhutchison who originally raised the issue and to add that a Reducer typeclass would be another possible addition, perhaps as an alternative to cats Reducible.

I'm certainly not suggesting that origami be a general bucket for fold-like-things that don't make it into cats. Far from that. I think that there are many justified use cases that are perhaps a bit too specialized/edge-casy to fit into a more general FP library but fully deserve to exist. I concede that this might be extending the original intent from "Monadic folds" to perhaps "Monadic folds and more."

And finally, here's another potential client for an Aggregator, from Spark - Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator

etorreborre commented 7 years ago

Just wondering. An Aggregator could be defined in origami with

  def aggregator[A, O : Monoid, B](f: A => O, finalize: O => B) = new FoldId[A, B] {
    type S = O

    def start = Monoid[O].empty
    def fold = (s: S, a: A) => Monoid[O].combine(s, f(a))
    def end(s: S) = finalize(s)
  }

The main difference with the Aggregator trait in Algebird (if we except the question of Semigroup vs Monoid for now) is that the "traversal" methods like reduce are explicitly not provided by the Fold datatype. It is the responsibility of the various types of streams to decide how they want to run those folds, possibly sequentially on the whole stream or in parallel. This should work because the intermediate state is always returned by the fold method.

On the question of Semigroup vs Monoid it seems to me that even for the case where an empty element cannot be produced (which is problematic I guess if you run a fold on an empty stream anyway), you can always create a Monoid[Option[A]] from Semigroup[A] (if we leave boxing/unboxing questions on the side for now).

johnynek commented 7 years ago

Algebird's fold is here: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Fold.scala

indeed you can create a Fold from an Aggregator: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala#L430

but again, once you have a fold, you can't work in parallel.

By contrast, with an Aggregator, after you do the initial prepare, you can build a tree of aggregation with the semigroup (as you can do any semigroup in parallel).

Additionally, if you have a commutative semigroup, you can do "map-side" aggregation, which is a huge performance improvement in hadoop using scalding or spark.

Lastly, immutable datastructures sadly often give you a big performance hit in these applications. Using a semigroup, we added sumOption which became combineAllOption in cats. This allows you to use a locally mutable algorithm to combine a big batch of values, or use parallelism, which make it easy to dramatically improve performance. Of course, in fact Folds can close over mutable state as well if things are hidden properly (indeed algebird's Fold keeps the internals private and only allows you to apply the Fold to a traversableOnce, empty, or single item). For instance, we have an immutable count-min-sketch, but we can use the mutable builder internally for batch aggregating, which can give something like a 1000x speed improvement in map-reduce:

https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L138

I realize, I am repeating myself somewhat but what your comment seems to be missing is that in a map reduce setting (e.g. spark), you can use Aggregator for parallel aggregation before doing the groupings by key.

For instance: https://github.com/twitter/algebird/blob/develop/algebird-spark/src/main/scala/com/twitter/algebird/spark/AlgebirdRDD.scala#L72

Indeed you can always lift a Semigroup into a Monoid. If you don't mind boxing. Since scala boxes so much anyway, I'm not sure this is a huge deal. If you keep Semigroup, you can push the Option to applying it, eg. def combineAllOption(ts: TraversableOnce[T]): Option[T] which means you only have 1 boxing for a batch, not every value.

ghost commented 7 years ago

commutative semigroup, you can do "map-side" aggregation,

with mutable combines is just what we had.

Also used twitterFutureSemigroup for final aggregation

etorreborre commented 7 years ago

@johnynek thanks for your RDD example. I don't remember the hadoop's / spark specifics but performance-wise is it the same to:

source.map(v => (v, f(v))).reduceByKey(g)

rather than

source.map(v => (v, v)).reduceByKey(v => g(f(v)))
etorreborre commented 7 years ago

Also, as discussed with @BennyHill I think that the main difference between folds in origami and algebird is that there are no traversal methods at all on origami folds. This is left to the collections using the folds to implement the traversals, a bit like you do with RDDs.

benhutchison commented 7 years ago

WRT the motives for my original proposal, its funny how limited they were and how divergent from the present discussion.

I was trying to do an efficient append operation over a parametrically typed container (underneath a Vector), and wondered "What type-class can I bring in defines append?". Googling around the only one I found was Ed Kmetts Reducer [ https://hackage.haskell.org/package/reducers-3.12.1/docs/Data-Semigroup-Reducer.html] which defined cons and snoc (ie append).

I kinda assumed at the time there was some foundational maths justifying why snoc was on the Reducer interface. But since then Ive come to the opinion that its inclusion in the Haskell typeclass was pretty arbitrary and perhaps a mistake; eg not all reducible containers need necessarily have 2 "ends" to append at.

I postulate a Snoc/Append typeclass might be separately useful, but IMO it should be separated from reduction folds.

On Thu, Aug 24, 2017 at 1:57 AM, Eric Torreborre notifications@github.com wrote:

Also, as discussed with @BennyHill https://github.com/bennyhill I think that the main difference between folds in origami and algebird is that there are no traversal methods at all on origami folds. This is left to the collections using the folds to implement the traversals, a bit like you do with RDDs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/atnos-org/origami/issues/5#issuecomment-324381037, or mute the thread https://github.com/notifications/unsubscribe-auth/AAF05B28HpgWeRziadhPrqegoB2TJhikks5sbEvSgaJpZM4O-or5 .

etorreborre commented 7 years ago

For the record, I rediscovered the idea of folds when working on specs2 and my first attempt was to use,... Reducers (after discussing with Ed Kmett :-)). I think I got the idea of packaging the functionalities I needed as monadic folds on my way back from LambdaJam 2014. Then I realised, through a post from Gabriel Gonzales, that this idea had already been described in 1993 (I think). Unfortunately I was looking for that original blog post recently but couldn't find it.

edmundnoble commented 7 years ago

@benhutchison I think that cons and snoc form right monoid actions. I don't know of a term for their relationship specifically in the literature, but if something is a monoid, it will always have two ends to append at, because you can "concatenate" something onto either end.

benhutchison commented 7 years ago

What are the two ends of a Set?

On Sat, Aug 26, 2017 at 3:18 AM, Edmund Noble notifications@github.com wrote:

@benhutchison https://github.com/benhutchison I think that cons and snoc form right monoid actions. I don't know of a term for their relationship specifically in the literature, but if something is a monoid, it will always have two ends to append at, because you can "concatenate" something onto either end.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/atnos-org/origami/issues/5#issuecomment-324983548, or mute the thread https://github.com/notifications/unsubscribe-auth/AAF05DK_pu4_4-iAfEUP2W8tKmRo1H-Yks5sbwHvgaJpZM4O-or5 .

edmundnoble commented 7 years ago

Sets are commutative monoids, so both ends are equivalent.

What I mean by "always have two ends to append at" is if you have a function f: A => M and M is a monoid, cons is always implementable as (a, m) => f(a) |+| m and snoc is always implementable as (m, a) => m |+| f(a). Because there's already the Monoid constraint, you always have at least a default implementation of this, even if both of these are equivalent and the Monoid is commutative.

etorreborre commented 5 years ago

I'm closing this issue now due to the lack of activity