Closed szunami closed 7 years ago
Why would it be awkward to update and merge the Union multiple times? You're basically describing the reason Union supports update(datum) call in the first place.
Is there a good way to combine two unions into a union? I guess I could call Compact result on one of them, and then update the other using the new compact sketch? Either way this was at least a bit unintuitive.
On May 22, 2017 7:49 PM, "Jon Malkin" notifications@github.com wrote:
Why would it be awkward to update and merge the Union multiple times? You're basically describing the reason Union supports an update() call in the first place.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DataSketches/sketches-core/issues/148#issuecomment-303250419, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZ_a_0qxfda1pGJwBpTaKbWTVEiQq9eks5r8h73gaJpZM4NjA5f .
Have you seen our Spark example here? https://datasketches.github.io/docs/Theta/ThetaSparkExample.html I am not sure what exactly is special about typed imperative aggregates you want. I am hoping that our example can be adapted to your use case.
@sszuflita
I think you have some fundamental misunderstandings of the architecture of the Theta Sketches, which, I admit, could use some improved documentation on the website :)
In the Theta family, the Sketch
class is an abstract base class that has two abstract sub-classes: UpdateSketch
and CompactSketch
. Compact sketches are not updatable and are essentially read-only. The update sketches are, of course updatable. The base Sketch
class cannot know what subclass is actually being represented, so it cannot support update operations, but exists to support common operations, such as getEstimate()
.
The Union
class supports high performance, iterative unioning of any type of Sketch
including compact (ordered and un-ordered), direct compact (ordered and un-ordered), update and direct update sketches) as well as updating the union with any of the primitives. So the Union can be somewhat of a Swiss army knife, in that you can use a only a union class in many situations where you might otherwise need both a Sketch and a Union side by side, as in many aggregators.
However, the Union
cannot be updated directly with another Union
object. Because, in order to achieve high iterative union performance (e.g., unioning millions of Sketch
objects together), the internal state of the Union is not completely "resolved" until getResult()
is called, which forces essentially a rebuild of the internal state of the sketch that produces the CompactSketch
result. There is no getEstimate()
on a union as it would require a rebuild to resolve the internal state and would as expensive as getResult().getEstimate()
. Because the Union does not support getEstimate() directly, we don't consider the Union to be a true "sketch", but rather a "Set Operation" or operator. (This is why Union
extends SetOperation
, not Sketch
.)
The CompactSketch
can be used as input for both the Union
or the simpler PairwiseSetOperation
, which is much more restricted to just ordered, compact sketches, but is faster when you only need to union a few sketches that qualify.
com.yahoo.sketches.theta.Union seems the most feasible; it can be updated and converted to a compact result, but ultimately it would be fairly awkward to update and merge several times.
Your 3rd statement is a bit fuzzy. The Union operator is specifically designed for frequent, flexible and iterative updating using either update(Sketch)
or update(<primitive>)
. Attempting to update a Union with another Union object is not supported because it is a performance killer and can be easily achieved using union1.update(union2.getResult())
. It is not recommended that you do this frequently, but making this conversion explicit makes shooting yourself in the foot very visible :)
Thanks a lot for the detailed explanation. I'd be happy to contribute some documentation with the info you just provided.
I'll experiment with the union.update(otherUnion.getResult()) for now and explore other options if this proves to be a blocker.
Fwiw, I find the name union to be somewhat misleading for something with quick updates and slow merges.
On May 22, 2017 9:07 PM, "Lee Rhodes" notifications@github.com wrote:
@sszuflita https://github.com/sszuflita
I think you have some fundamental misunderstandings of the architecture of the Theta Sketches, which, I admit, could use some improved documentation on the website :)
In the Theta family, the Sketch class is an abstract base class that has two abstract sub-classes: UpdateSketch and CompactSketch. Compact sketches are not updatable and are essentially read-only. The update sketches are, of course updatable. The base Sketch class cannot know what subclass is actually being represented, so it cannot support update operations, but exists to support common operations, such as getEstimate().
The Union class supports high performance, iterative unioning of any type of Sketch including compact (ordered and un-ordered), direct compact (ordered and un-ordered), update and direct update sketches) as well as updating the union with any of the primitives. So the Union can be somewhat of a Swiss army knife, in that you can use a only a union class in many situations where you might otherwise need both a Sketch and a Union side by side, as in many aggregators.
However, the Union cannot be updated directly with another Union object. Because, in order to achieve high iterative union performance (e.g., unioning millions of Sketch objects together), the internal state of the Union is not completely "resolved" until getResult() is called, which forces essentially a rebuild of the internal state of the sketch that produces the CompactSketch result. There is no getEstimate() on a union as it would require a rebuild to resolve the internal state and would as expensive as getResult().getEstimate(). Because the Union does not support getEstimate() directly, we don't consider the Union to be a true "sketch", but rather a "Set Operation" or operator. (This is why Union extends SetOperation, not Sketch.)
The CompactSketch can be used as input for both the Union or the simpler PairwiseSetOperation, which is much more restricted to just ordered, compact sketches, but is faster when you only need to union a few sketches that qualify.
com.yahoo.sketches.theta.Union seems the most feasible; it can be updated and converted to a compact result, but ultimately it would be fairly awkward to update and merge several times.
Your 3rd statement is a bit fuzzy. The Union operator is specifically
designed for frequent, flexible and iterative updating using either
update(Sketch) or update(
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/DataSketches/sketches-core/issues/148#issuecomment-303261348, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZ_a6NiTjpkmPIzfbp6kDYfc69nCw9zks5r8jE3gaJpZM4NjA5f .
A union has an efficient update(
IIRC, @AlexanderSaydakov had to jump through some hoops to get Spark to process things in a reasonably efficient manner. But that seemed to be due to assumptions and Inflexibility in Spark's processing model.
I was able to do a bit of testing using the Union approach, and the typed imperative aggregate was on par with the other aggregations I am computing, so the merge isn't proving to be a bottleneck (on smaller datasets). Thanks again for the input here; think we can close this out.
But that seemed to be due to assumptions and Inflexibility in Spark's processing model.
I am not sure I would go so far as to say that Spark has inflexible processing model. It seems to be very flexible to me. My difficulty was to learn some of this variety of ways to organize processing in Spark to achieve good performance with Theta sketches. I hope I described the ways I found on that example page I linked.
I still would love to understand what exactly is special about this "typed imperative" aggregation. Does it present some other difficulty, so that my examples are not relevant? @sszuflita, could you elaborate on this please?
The main difference between the provided spark example and the TypedImperativeAggregate approach is that the provided spark example uses a Map Reduce approach; it assumes we do all of our updates, and then all of our merges. Depending on the TIA aggregate mode, merges may be called on intermediate results [1].
The spark example also relies on a decent amount of casting and null checking, and it is not at all clear to me that the code as written would correctly handle interleaving updates and merges.
[1] Javadoc on the spark TIA: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L398
I'm trying to update a theta sketch using a spark typed imperative aggregate. In order to achieve this, I want a representation of the sketch which supports both update and union. In my brief experimentation, this doesn't seem possible (or I could be missing something).
I hope I'm just missing something simple; thanks in advance!