apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.
https://datasketches.apache.org
Apache License 2.0
878 stars 206 forks source link

Merge and Update a Theta Sketch #148

Closed szunami closed 7 years ago

szunami commented 7 years ago

I'm trying to update a theta sketch using a spark typed imperative aggregate. In order to achieve this, I want a representation of the sketch which supports both update and union. In my brief experimentation, this doesn't seem possible (or I could be missing something).

I hope I'm just missing something simple; thanks in advance!

jmalkin commented 7 years ago

Why would it be awkward to update and merge the Union multiple times? You're basically describing the reason Union supports update(datum) call in the first place.

szunami commented 7 years ago

Is there a good way to combine two unions into a union? I guess I could call Compact result on one of them, and then update the other using the new compact sketch? Either way this was at least a bit unintuitive.

On May 22, 2017 7:49 PM, "Jon Malkin" notifications@github.com wrote:

Why would it be awkward to update and merge the Union multiple times? You're basically describing the reason Union supports an update() call in the first place.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DataSketches/sketches-core/issues/148#issuecomment-303250419, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZ_a_0qxfda1pGJwBpTaKbWTVEiQq9eks5r8h73gaJpZM4NjA5f .

AlexanderSaydakov commented 7 years ago

Have you seen our Spark example here? https://datasketches.github.io/docs/Theta/ThetaSparkExample.html I am not sure what exactly is special about typed imperative aggregates you want. I am hoping that our example can be adapted to your use case.

leerho commented 7 years ago

@sszuflita

I think you have some fundamental misunderstandings of the architecture of the Theta Sketches, which, I admit, could use some improved documentation on the website :)

In the Theta family, the Sketch class is an abstract base class that has two abstract sub-classes: UpdateSketch and CompactSketch. Compact sketches are not updatable and are essentially read-only. The update sketches are, of course updatable. The base Sketch class cannot know what subclass is actually being represented, so it cannot support update operations, but exists to support common operations, such as getEstimate().

The Union class supports high performance, iterative unioning of any type of Sketch including compact (ordered and un-ordered), direct compact (ordered and un-ordered), update and direct update sketches) as well as updating the union with any of the primitives. So the Union can be somewhat of a Swiss army knife, in that you can use a only a union class in many situations where you might otherwise need both a Sketch and a Union side by side, as in many aggregators.

However, the Union cannot be updated directly with another Union object. Because, in order to achieve high iterative union performance (e.g., unioning millions of Sketch objects together), the internal state of the Union is not completely "resolved" until getResult() is called, which forces essentially a rebuild of the internal state of the sketch that produces the CompactSketch result. There is no getEstimate() on a union as it would require a rebuild to resolve the internal state and would as expensive as getResult().getEstimate(). Because the Union does not support getEstimate() directly, we don't consider the Union to be a true "sketch", but rather a "Set Operation" or operator. (This is why Union extends SetOperation, not Sketch.)

The CompactSketch can be used as input for both the Union or the simpler PairwiseSetOperation, which is much more restricted to just ordered, compact sketches, but is faster when you only need to union a few sketches that qualify.

com.yahoo.sketches.theta.Union seems the most feasible; it can be updated and converted to a compact result, but ultimately it would be fairly awkward to update and merge several times.

Your 3rd statement is a bit fuzzy. The Union operator is specifically designed for frequent, flexible and iterative updating using either update(Sketch) or update(<primitive>). Attempting to update a Union with another Union object is not supported because it is a performance killer and can be easily achieved using union1.update(union2.getResult()). It is not recommended that you do this frequently, but making this conversion explicit makes shooting yourself in the foot very visible :)

szunami commented 7 years ago

Thanks a lot for the detailed explanation. I'd be happy to contribute some documentation with the info you just provided.

I'll experiment with the union.update(otherUnion.getResult()) for now and explore other options if this proves to be a blocker.

Fwiw, I find the name union to be somewhat misleading for something with quick updates and slow merges.

On May 22, 2017 9:07 PM, "Lee Rhodes" notifications@github.com wrote:

@sszuflita https://github.com/sszuflita

I think you have some fundamental misunderstandings of the architecture of the Theta Sketches, which, I admit, could use some improved documentation on the website :)

In the Theta family, the Sketch class is an abstract base class that has two abstract sub-classes: UpdateSketch and CompactSketch. Compact sketches are not updatable and are essentially read-only. The update sketches are, of course updatable. The base Sketch class cannot know what subclass is actually being represented, so it cannot support update operations, but exists to support common operations, such as getEstimate().

The Union class supports high performance, iterative unioning of any type of Sketch including compact (ordered and un-ordered), direct compact (ordered and un-ordered), update and direct update sketches) as well as updating the union with any of the primitives. So the Union can be somewhat of a Swiss army knife, in that you can use a only a union class in many situations where you might otherwise need both a Sketch and a Union side by side, as in many aggregators.

However, the Union cannot be updated directly with another Union object. Because, in order to achieve high iterative union performance (e.g., unioning millions of Sketch objects together), the internal state of the Union is not completely "resolved" until getResult() is called, which forces essentially a rebuild of the internal state of the sketch that produces the CompactSketch result. There is no getEstimate() on a union as it would require a rebuild to resolve the internal state and would as expensive as getResult().getEstimate(). Because the Union does not support getEstimate() directly, we don't consider the Union to be a true "sketch", but rather a "Set Operation" or operator. (This is why Union extends SetOperation, not Sketch.)

The CompactSketch can be used as input for both the Union or the simpler PairwiseSetOperation, which is much more restricted to just ordered, compact sketches, but is faster when you only need to union a few sketches that qualify.

com.yahoo.sketches.theta.Union seems the most feasible; it can be updated and converted to a compact result, but ultimately it would be fairly awkward to update and merge several times.

Your 3rd statement is a bit fuzzy. The Union operator is specifically designed for frequent, flexible and iterative updating using either update(Sketch) or update(). Attempting to update a Union with another Union object is not supported because it is a performance killer and can be easily achieved using union1.update(union2.getResult()). It is not recommended that you do this frequently, but making this conversion explicit makes shooting yourself in the foot very visible :)

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/DataSketches/sketches-core/issues/148#issuecomment-303261348, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZ_a6NiTjpkmPIzfbp6kDYfc69nCw9zks5r8jE3gaJpZM4NjA5f .

jmalkin commented 7 years ago

A union has an efficient update() and update().

IIRC, @AlexanderSaydakov had to jump through some hoops to get Spark to process things in a reasonably efficient manner. But that seemed to be due to assumptions and Inflexibility in Spark's processing model.

szunami commented 7 years ago

I was able to do a bit of testing using the Union approach, and the typed imperative aggregate was on par with the other aggregations I am computing, so the merge isn't proving to be a bottleneck (on smaller datasets). Thanks again for the input here; think we can close this out.

AlexanderSaydakov commented 7 years ago

But that seemed to be due to assumptions and Inflexibility in Spark's processing model.

I am not sure I would go so far as to say that Spark has inflexible processing model. It seems to be very flexible to me. My difficulty was to learn some of this variety of ways to organize processing in Spark to achieve good performance with Theta sketches. I hope I described the ways I found on that example page I linked.

I still would love to understand what exactly is special about this "typed imperative" aggregation. Does it present some other difficulty, so that my examples are not relevant? @sszuflita, could you elaborate on this please?

szunami commented 7 years ago

The main difference between the provided spark example and the TypedImperativeAggregate approach is that the provided spark example uses a Map Reduce approach; it assumes we do all of our updates, and then all of our merges. Depending on the TIA aggregate mode, merges may be called on intermediate results [1].

The spark example also relies on a decent amount of casting and null checking, and it is not at all clear to me that the code as written would correctly handle interleaving updates and merges.

[1] Javadoc on the spark TIA: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L398