apache / datasketches-postgresql

PostgreSQL extension providing approximate algorithms based on apache/datasketches-cpp
https://datasketches.apache.org
Apache License 2.0
86 stars 11 forks source link

Should we treat `null` as empty sketch? #27

Closed phstudy closed 4 years ago

phstudy commented 4 years ago

Union and intersection operations treat null value as empty sketch in datasketches-java. I think it's less meaning to return null when calculating theta_sketch_get_estimate and theta_sketch_intersection.

Should we follow the rule in PostgreSQL?

AlexanderSaydakov commented 4 years ago

Could you give a specific example of a behavior you don't like?

phstudy commented 4 years ago

I think estimation function should return 0 rather than null. It's much convenient, otherwise I have to write case with to handle the null value. The estimation function in BigQuery HLL_COUNT also returns 0 when input value is null.

Estimation function Sketch Function Return value
Theta Sketch theta_sketch_get_estimate(null) null
HLL_COUNT HLL_COUNT.EXTRACT(null) 0
leerho commented 4 years ago

Although it is true that currently Java equates null with empty for union, intersection and difference. However, we have been getting feedback that the impact of null=empty for intersection and difference operations seriously affects the result and by silently substituting empty the user would get weird results without any warning or indication why. It would also be difficult to track down and debug. (Note: null=empty in union is rather harmless.) So when we move to -java version 2.0.0, we will be changing that behavior so that a null with intersection and difference operations will throw an exception.

I realize that nulls occur quite frequently in all kinds of raw data. But the significance of null is clearly dependent on the application. Nonetheless, nulls propagate and what you show above is an example of what could be very undesirable propagation of a null. And when doing numerical analysis of data, one needs to be very clear about the handling of nulls, otherwise, one could end up with garbage results and without any warning. How nulls are handled in one application could be quite different in another application.

For a foundation library, such as DataSketches, I think it is very risky for us to assume that null=empty everywhere. And just because BigQuery assumes that, does not make it good policy.

phstudy commented 4 years ago

I got it. Thanks!