Closed ghost closed 5 years ago
Even with a provisional :+1:, the cats issue would be able to be closed, hence a /CC to @johnynek as a cats maintainer
So there's no agreement right now that the Aggregator
typeclass should be in cats, right? What could be done for now is an origami-twitter
module with a method to transform some of origami Folds
into Twitter Aggregator
instances.
Aggregator is parallel. Fold is sequential. Algebird has both.
Aggregator is basically: T => A, Semigroup[A], A => B
and can be an applicative.
MonoidAggregator is the same with a Monoid.
If I could do it again, I would do:
trait Aggregator[-I, +O] {
type A
def prepare(i: I): A
def semigroup: Semigroup[A]
def present(a: A): O
}
(hide the middle type in a type member).
Since a semigroup can be applied in parallel (since it is associative), the Aggregator can.
Also, CommutativeAggregator is interesting, since in map-reduce settings (like spark, hadoop), you can do pre-aggregation before you group by the key to finish the combining.
Not the same as Fold. All Aggregators can be a Fold, but not vice-versa.
Thanks for the explanation, now I understand. I'm wondering if we generalize this to Applicative
:
trait Aggregator[F[_], -I, +O] {
type A
val applicative: Alternative[F]
def prepare(i: I): F[A]
def present(a: A): F[O]
}
I think that Const[A, B]
where there A : Monoid
gives rise to an Alternative[Const[A, ?]]
no? Then we can even have some effects in our Aggregator
s.
I haven't thought about it. There may be some useful examples, but algebird has MANY Aggregator examples. So many things can be expressed that way (virtually everything anyone uses from SQL for instance). We got by at Stripe with almost all our feature generation for a long time being expressed as Aggregators.
So, I would strongly encourage you to have this type since so many computations fall into the pattern. For Alternative, you have to be prepare to make a monoid for all A
, no? That seems to dramatically weaken the number of possible implementations since you can't specify the monoid based on A
, which is quite useful.
For Alternative, you have to be prepare to make a monoid for all A, no?
That was the purpose of my comment about using Const
for that case. I need to try it out and come back to you.
I also agree this type is wonderfully powerful yet deceptively simple. My commercial use case was for a distributed, big data database for multi-dimensional time series. The exact dimension was a runtime parameter, so the monoid's zero
also only known at runtime. So I had to use a Semigroup
, in this case a CommutativeSemigroup
was a better fit.
So this is why I'm such a fan of the Aggregator
but also, adding to the discussion above, why I think it be best not to have it in cats; all the other examples make sense too so why not have all of them?
I'm sure a quick flick through some literature/haskell libs would yield a whole bunch of other additions for folds, but would bloat out cats. So why not have the fold
as-is in cuddly-cats, and have origami as the goto for all the rest?
wonderfully powerful yet deceptively simple
Hmm, perhaps that should have been "deceptively powerful yet wonderfully simple". I'm sure you get my gist :wink:
More importantly is to /CC @benhutchison who originally raised the issue and to add that a Reducer typeclass would be another possible addition, perhaps as an alternative to cats Reducible.
I'm certainly not suggesting that origami be a general bucket for fold-like-things that don't make it into cats. Far from that. I think that there are many justified use cases that are perhaps a bit too specialized/edge-casy to fit into a more general FP library but fully deserve to exist. I concede that this might be extending the original intent from "Monadic folds" to perhaps "Monadic folds and more."
And finally, here's another potential client for an Aggregator
, from Spark - Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator
Just wondering. An Aggregator
could be defined in origami with
def aggregator[A, O : Monoid, B](f: A => O, finalize: O => B) = new FoldId[A, B] {
type S = O
def start = Monoid[O].empty
def fold = (s: S, a: A) => Monoid[O].combine(s, f(a))
def end(s: S) = finalize(s)
}
The main difference with the Aggregator
trait in Algebird (if we except the question of Semigroup
vs Monoid
for now) is that the "traversal" methods like reduce
are explicitly not provided by the Fold
datatype. It is the responsibility of the various types of streams to decide how they want to run those folds, possibly sequentially on the whole stream or in parallel. This should work because the intermediate state is always returned by the fold
method.
On the question of Semigroup
vs Monoid
it seems to me that even for the case where an empty
element cannot be produced (which is problematic I guess if you run a fold on an empty stream anyway), you can always create a Monoid[Option[A]]
from Semigroup[A]
(if we leave boxing/unboxing questions on the side for now).
Algebird's fold is here: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Fold.scala
indeed you can create a Fold from an Aggregator: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala#L430
but again, once you have a fold, you can't work in parallel.
By contrast, with an Aggregator, after you do the initial prepare, you can build a tree of aggregation with the semigroup (as you can do any semigroup in parallel).
Additionally, if you have a commutative semigroup, you can do "map-side" aggregation, which is a huge performance improvement in hadoop using scalding or spark.
Lastly, immutable datastructures sadly often give you a big performance hit in these applications. Using a semigroup, we added sumOption
which became combineAllOption
in cats. This allows you to use a locally mutable algorithm to combine a big batch of values, or use parallelism, which make it easy to dramatically improve performance. Of course, in fact Folds can close over mutable state as well if things are hidden properly (indeed algebird's Fold keeps the internals private and only allows you to apply the Fold to a traversableOnce, empty, or single item). For instance, we have an immutable count-min-sketch, but we can use the mutable builder internally for batch aggregating, which can give something like a 1000x speed improvement in map-reduce:
I realize, I am repeating myself somewhat but what your comment seems to be missing is that in a map reduce setting (e.g. spark), you can use Aggregator for parallel aggregation before doing the groupings by key.
Indeed you can always lift a Semigroup into a Monoid. If you don't mind boxing. Since scala boxes so much anyway, I'm not sure this is a huge deal. If you keep Semigroup, you can push the Option to applying it, eg. def combineAllOption(ts: TraversableOnce[T]): Option[T]
which means you only have 1 boxing for a batch, not every value.
commutative semigroup, you can do "map-side" aggregation,
with mutable combines is just what we had.
Also used twitterFutureSemigroup for final aggregation
@johnynek thanks for your RDD example. I don't remember the hadoop's / spark specifics but performance-wise is it the same to:
source.map(v => (v, f(v))).reduceByKey(g)
rather than
source.map(v => (v, v)).reduceByKey(v => g(f(v)))
Also, as discussed with @BennyHill I think that the main difference between folds in origami and algebird is that there are no traversal methods at all on origami folds. This is left to the collections using the folds to implement the traversals, a bit like you do with RDDs.
WRT the motives for my original proposal, its funny how limited they were and how divergent from the present discussion.
I was trying to do an efficient append operation over a parametrically typed container (underneath a Vector), and wondered "What type-class can I bring in defines append?". Googling around the only one I found was Ed Kmetts Reducer [ https://hackage.haskell.org/package/reducers-3.12.1/docs/Data-Semigroup-Reducer.html] which defined cons and snoc (ie append).
I kinda assumed at the time there was some foundational maths justifying why snoc was on the Reducer interface. But since then Ive come to the opinion that its inclusion in the Haskell typeclass was pretty arbitrary and perhaps a mistake; eg not all reducible containers need necessarily have 2 "ends" to append at.
I postulate a Snoc/Append typeclass might be separately useful, but IMO it should be separated from reduction folds.
On Thu, Aug 24, 2017 at 1:57 AM, Eric Torreborre notifications@github.com wrote:
Also, as discussed with @BennyHill https://github.com/bennyhill I think that the main difference between folds in origami and algebird is that there are no traversal methods at all on origami folds. This is left to the collections using the folds to implement the traversals, a bit like you do with RDDs.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/atnos-org/origami/issues/5#issuecomment-324381037, or mute the thread https://github.com/notifications/unsubscribe-auth/AAF05B28HpgWeRziadhPrqegoB2TJhikks5sbEvSgaJpZM4O-or5 .
For the record, I rediscovered the idea of folds when working on specs2 and my first attempt was to use,... Reducers (after discussing with Ed Kmett :-)). I think I got the idea of packaging the functionalities I needed as monadic folds on my way back from LambdaJam 2014. Then I realised, through a post from Gabriel Gonzales, that this idea had already been described in 1993 (I think). Unfortunately I was looking for that original blog post recently but couldn't find it.
@benhutchison I think that cons and snoc form right monoid actions. I don't know of a term for their relationship specifically in the literature, but if something is a monoid, it will always have two ends to append at, because you can "concatenate" something onto either end.
What are the two ends of a Set?
On Sat, Aug 26, 2017 at 3:18 AM, Edmund Noble notifications@github.com wrote:
@benhutchison https://github.com/benhutchison I think that cons and snoc form right monoid actions. I don't know of a term for their relationship specifically in the literature, but if something is a monoid, it will always have two ends to append at, because you can "concatenate" something onto either end.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/atnos-org/origami/issues/5#issuecomment-324983548, or mute the thread https://github.com/notifications/unsubscribe-auth/AAF05DK_pu4_4-iAfEUP2W8tKmRo1H-Yks5sbwHvgaJpZM4O-or5 .
Sets are commutative monoids, so both ends are equivalent.
What I mean by "always have two ends to append at" is if you have a function f: A => M
and M
is a monoid, cons
is always implementable as (a, m) => f(a) |+| m
and snoc
is always implementable as (m, a) => m |+| f(a)
. Because there's already the Monoid
constraint, you always have at least a default implementation of this, even if both of these are equivalent and the Monoid
is commutative.
I'm closing this issue now due to the lack of activity
Ref https://github.com/typelevel/cats/issues/1360
I think having an
Aggregator
somewhere would be handy, andorigami
looks a good home.Time permitting, I would be happy to help out here.