apache / datasketches-pig

Sketch adaptors for Pig.
https://datasketches.apache.org
Apache License 2.0
9 stars 9 forks source link

EvalFunc Return Type #26

Closed kainoa21 closed 4 years ago

kainoa21 commented 8 years ago

I am curious about the decision to have the return type of all of the *ToSketch UDFs be Tuple as opposed to DataByteArray.

When attempting to use these UDFs, it seems more natural to just have them return the DataByteArray as opposed to wrapping it in a Tuple? I couldn't find any instances where these functions really needed to return more than the single binary representation of the sketch.

Wrapping the result in a Tuple requires the user to call FLATTEN before being able to store the intermediate result or to pass the result into one of the UDFs to extract the estimate.

AlexanderSaydakov commented 8 years ago

There is a reason for this. I don't recall exactly what it was. I thought about this and tried it at some point during development. I think it has to do with the way Pig packages the intermediate data (data between the stages).

kainoa21 commented 8 years ago

Agreed that with the Algebraic interface you have to do a type check, but that has to happen no matter what (https://github.com/DataSketches/sketches-pig/blob/master/src/main/java/com/yahoo/sketches/pig/theta/DataToSketch.java#L517)

Or are you referring to something else?

In other similar libraries that compute binary data structures, they generally just return the DataByteArray type directly. This allows a cleaner syntax, especially when used together with the estimate udf like getEstimate(dataToSketch(bag.field))

Would you be open to breaking PR that changes the return types of the sketch functions?

AlexanderSaydakov commented 8 years ago

The initial stage in the algebraic mode is pass-through by design. So it has to return the original tuple. This is done for performance. Otherwise each input record would be converted into a sketch with one entry.

kainoa21 commented 8 years ago

That makes perfect sense and the return types of the Initial and Intermediate exec functions have to be Tuple. But the return type of the Final exec just needs to match the generic type defined by the UDF.

I think this is a great library but this tripped me up when I tried to use it the first time because I was expecting the sketch udf to return a DataByteArray as that is more the norm for these types of functions.

AlexanderSaydakov commented 8 years ago

I want to clarify this issue.

Our design decision was to have a pass-through initial stage of the Algebraic interface for performance reasons. The alternative would be to produce a sketch with one entry for each input row, and in the next stages only deal with merging sketches. This would work, but would be very inefficient.

A pass-through initial stage has to return the original tuple, and this dictates the return types of all other stages and the UDF overall because Pig expects the types to agree.

Yes, I agree, it does look a bit unnatural and unnecessary on the first glance, but having to flatten the result is a small price to pay for the efficiency.

kainoa21 commented 8 years ago

I believe this statement is incorrect: "A pass-through initial stage has to return the original tuple, and this dictates the return types of all other stages and the UDF overall because Pig expects the types to agree."

That was the point I was trying to make in my previous comment, that the return type for the Initial and Intermediate stage is expected to be Tuple, but that the return type from the Final stage is what matters and that is the only one that has to match the generic type of the EvalFunc class.

AlexanderSaydakov commented 8 years ago

Oh, now I see your point. You might be right. I would like to investigate this. These UDFs were implemented several years ago by somebody else, and I took this requirement of the same output type for granted. Maybe it was required before, I am not sure. If this works as you suggest, we may change the UDFs at some point, but it is an incompatible change, so we will need to find the right opportunity. I would suggest keeping this issue open for now.

kainoa21 commented 8 years ago

Thanks @AlexanderSaydakov. As a POC, see this fork which makes this change for the Theta sketch.

I verified that this works and gives consistent results with the test data and script found here. In fact, it works with the script as is: flatten(dataToSketch(a.id)) as (sketch) AND also works calling it like so: dataToSketch(a.id) as sketch. So in some sense it may be pseudo backward compatible.

I do think it makes the resulting pig syntax a little cleaner, unsure if that is really worth it.

leerho commented 4 years ago

I am labeling this as a possible enhancement, but low priority and closing this issue after lingering for over 3 years. We can always reopen this issue and revisit this in the future.