apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.29k stars 3.66k forks source link

Adding support for Exact Numeric Types #10190

Open rahafrouz opened 3 years ago

rahafrouz commented 3 years ago

Description

Druid supports Double data type which is an Approximate Numeric Types. Is it on the roadmap anywhere to add support for Exact Numeric Types?

Highest numerical precision we can get with druid is the double (64bit) data type. Due to the nature of binary storage format, we would have 53 meaningful bits, which means 15 meaningful decimal digits for sure. ( 53 log10(2) ≈ 15.955 ).

As an example of Exact Numeric type: BigQuery has support for NUMERIC data type which gives 38 digits of precision and 9 decimal digits of scale. This is the ideal field for financial reporting, or any high precision calculations.

Motivation

Support for data with High precision ranging that contains very small and very big data points without data loss.

a2l007 commented 3 years ago

We have an internal custom aggregator based on BigDecimal which is being used in several production usecases. We'll look into opensourcing it for the benefit of the community.

jihoonson commented 3 years ago

@a2l007 that sounds great! I have been looking at this area recently too, and wondering if there is any performance issue with using BigDecimal. Do you have know how slow it is compared to aggregators based on primitive types?

a2l007 commented 3 years ago

We use a variant of BigDecimal called CompressedBigDecimal with the difference being that the latter is mutable and instead of using the object representation for complex types, it is represented as block compressible columns.

Haven't had a chance to do performance testing against the primitive types, but the CompressedBigDecimal based aggregator performs significantly better than BigDecimal itself.

jihoonson commented 3 years ago

Oh I see. Looking forward to seeing it!

himanshug commented 3 years ago

extension that @a2l007 mentioned (based on what I recall and work done when I was at Yahoo): "big number" is stored as a primitive long data type where last 2 digits are assumed to be 2 digits right of decimal. that gives you a "big number" with 2 digits worth of precision. This was sufficient for the use case we had.

FWIW, I also did a prototype using java BigDecimal data type that was too slow for the use case, also tried creating a custom BigDecimal data type using two longs (one for integer part and other for fractional part) which also couldn't match the performance obtained by simpler approach above.

jihoonson commented 3 years ago

FWIW, I also did a prototype using java BigDecimal data type that was too slow for the use case, also tried creating a custom BigDecimal data type using two longs (one for integer part and other for fractional part) which also couldn't match the performance obtained by simpler approach above.

@himanshug thanks for sharing. I was thinking the same. Did the performance not meet only your requirement or just slow in general?

himanshug commented 3 years ago

@jihoonson java's BigDecimal is immutable . So, "aggregate" operation creates new objects unnecessarily. You can't aggregate with BigDecimal in serialized off-heap representation, so a naive BufferAggregator off-heap impl would require constantly deserializing state stored off-heap into on-heap java BigDecimal object for aggregation then serializing it again into the off-heap buffer. So, using java BigDecimal is slow in general.

One line of thought was to create a custom BigDecimal implementation with code/algo pretty much same as java's BigDecimal, but that was mutable and had an equivalent off-heap representation with operations doable on the off-heap form itself (e.g. the way we do things for thetaSketches etc) . That complexity wasn't worth the effort/time given that our use case was satisfied by simpler approach described in earlier comment.

jihoonson commented 3 years ago

@himanshug oh sorry, in your last comment, I missed that the performance wasn't as good as the simpler approach :sweat_smile: I'm considering adding a new Decimal data type which would be stored in 2 integers along with all the type-specific aggregators, but haven't written any code yet. So I've been wondering what could impact on query performance in general in this approach. Thanks for sharing your experience.

a2l007 commented 3 years ago

FYI the proposal for the compressed bigdecimal extension is up #10668

github-actions[bot] commented 10 months ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

usberic commented 9 months ago

Hi, I'm working on an application which deals with aggregating a few currency-based fields and was experimenting with using this data type extension. However, part of our application requires using Druid SQL queries, which seem to not support Compressed BigDecimal. I understand native queries are capable of performing aggregation functions, however is there any plan to support Druid SQL queries? Specifically I'm trying to use the summation of certain columns, however Druid SQL's SUM operator does not seem to support Complex data types.

github-actions[bot] commented 3 days ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.