Closed phstudy closed 4 years ago
Could you give a specific example of a behavior you don't like?
I think estimation function should return 0
rather than null
. It's much convenient, otherwise I have to write case with
to handle the null
value. The estimation function in BigQuery HLL_COUNT
also returns 0
when input value is null
.
Estimation function | Sketch | Function | Return value |
---|---|---|---|
Theta Sketch | theta_sketch_get_estimate(null) | null | |
HLL_COUNT | HLL_COUNT.EXTRACT(null) | 0 |
Although it is true that currently Java equates null with empty for union, intersection and difference. However, we have been getting feedback that the impact of null=empty for intersection and difference operations seriously affects the result and by silently substituting empty the user would get weird results without any warning or indication why. It would also be difficult to track down and debug. (Note: null=empty in union is rather harmless.) So when we move to -java version 2.0.0, we will be changing that behavior so that a null with intersection and difference operations will throw an exception.
I realize that nulls occur quite frequently in all kinds of raw data. But the significance of null is clearly dependent on the application. Nonetheless, nulls propagate and what you show above is an example of what could be very undesirable propagation of a null. And when doing numerical analysis of data, one needs to be very clear about the handling of nulls, otherwise, one could end up with garbage results and without any warning. How nulls are handled in one application could be quite different in another application.
For a foundation library, such as DataSketches, I think it is very risky for us to assume that null=empty everywhere. And just because BigQuery assumes that, does not make it good policy.
I got it. Thanks!
Union and intersection operations treat null value as empty sketch in datasketches-java. I think it's less meaning to return
null
when calculatingtheta_sketch_get_estimate
andtheta_sketch_intersection
.Should we follow the rule in PostgreSQL?