elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.56k stars 24.35k forks source link

Add BigDecimal data type #46934

Open tzachshabtay opened 4 years ago

tzachshabtay commented 4 years ago

BigDecimal data type:

For a financial application, we need maximum accuracy and arbitrary precision, even at the expense of performance. scaled_float doesn't cut it for us (we need numbers bigger than 100 billion and with more than 8 digits after the decimal place, so there is no scaling factor that can work), and double obviously isn't accurate enough for doing financial calculations as it's using a floating point value.

The values must be passed as strings in the json and then parsed and stored directly to BigDecimal in elastic search, as strings are the only way to pass accurate financial numbers (with arbitrary precision) in a json document.

jbaiera commented 4 years ago

Is there any reason that these values would need to be stored as BigDecimal on Elasticsearch? Could regular keywords work well enough? There has been interest from the community in the past around adding Big(Numeric) type support to mappings, but often times using string data directly is effective enough for most applications.

For reference, this issue (https://github.com/elastic/elasticsearch/issues/17006) contains some discussions around BigInteger and BigDecimal support.

elasticmachine commented 4 years ago

Pinging @elastic/es-search

tzachshabtay commented 4 years ago

Could regular keywords work well enough?

I don't think so, but maybe I don't understand the suggestion. Here's a simple use-case to illustrate: imagine an index for trades, each trade has account id (string) and money (BigDecimal hopefully). For that we have an aggregation in grafana/kibana that sums the money for each account id. Would having money as a keyword help making this accurate? How? Is it converted from string to BigDecimal for the calcualtions?

giladgal commented 4 years ago

Thank you for making this suggestion. We are not interested to develop support for BigDecimal. Further information about our approach towards support for high accuracy data types can be found in other issues such as Support for BigInteger and BigDecimal #17006 (mentioned above), Add support for 64-bit unsigned integers #32434, and Support for Ethereum data type (uint256) #38242.

We are interested in fully understanding your use case since in the past we did occasionally found that what starts as a general request for BigDecimal can be answered either with high precision data types that have more limited precisions or with alternative solutions. In addition, insight into the use case can help us in future roadmap decisions.

tzachshabtay commented 4 years ago

Hi @giladgal, "high accuracy" data types are not enough, we need "complete accuracy", i.e no precision loss. Any functionality that requires any sort of accounting (keeping track of ledgers, etc) cannot afford any data loss. I have read through the linked issues, but I did not see any solution. The only thing that's relevant for my use-case in those issues is probably this comment which echoes my thoughts exactly (minus the comment about the json spec, I'm referring to the sentiment): https://github.com/elastic/elasticsearch/issues/17006#issuecomment-249375015

I have described a simple use-case above, I guess I can expand on it a bit? We're a clearing firm that keeps track of trades from different accounts. We might do a lot of different calculations on those trades (like calculating profits, fees, risk, exposure, margin, different regulatory requirements) and they need to be precise. You don't want your bank to charge you a fee of 0.20000001$ instead of 0.2$, right? So we need arbitrary precision to be absolutely sure these things won't happen, and we need no floating point math at all involved to be sure these things won't happen. And having actually really tiny numbers like 0.00000001 (i.e not due to floating point errors) is a real possibility. We have processes that compare between different workflows and aggregate numbers to see we get the same numbers (so if one workflow gives us 1000.0005 and the other workflow gives us 1000.000501 that will issue a real-time alert to our ops team: if we use elastic they will get a lot of false alerts, and we can't do a comparison with "tolerance" as that might be a real issue). It can aggregate the numbers across a lot of trades so the added numbers can be huge (like I said in the first post, over 100 billion). Combining both big numbers and high precision for decimal points rule out scaled float as a possible solution (and that without mentioning the suspicious fact that it seems that even scaled float is using floating point math, like here: https://github.com/elastic/elasticsearch/blob/e2b9c1b764d6fa495b4e78f414f810995019e856/modules/mapper-extras/src/main/java/org/elasticsearch/index/mapper/ScaledFloatFieldMapper.java#L300, even though in other places it does look like it's using BigDecimal? Anyway, just an observation, but it's besides the point really).

In summary, why do you need a decimal type in elastic? For the same reason you need it in any programming language and/or database: when you can't afford precision loss (i.e in a lot of financial/accounting/science/math applications).

tanishq-dubey commented 4 years ago

@jbaiera

Could regular keywords work well enough?

You cannot to agg across keywords

but often times using string data directly is effective enough for most applications

Sure, when you are not doing any sort of numerical operation on the data.


@giladgal Issue #38242 is a perfect example of why your argument doesn't stand. Using some high precision number that is above the bounds of longValue * (double)scalingFactor causes breaks right away.

From the other issues you linked, I see that one of the primary concerns is performance. To me, this seems like it should be a user decision: if I as the user choose to use an arbitrary precision type, at the cost of performance, that should be my decision, while the compute back end does the work.

jimczi commented 4 years ago

I understand your frustration but performance is not the only reason why we don't expose BigDecimal in metrics aggregations. We choose doubles over any other representation to ensure that we have a single type of source to handle, any field that is used in a metric aggregation is first casted/transformed into a double so that we can have a single implementation for all of them. We are an analytics search engine and a lot of aggregations are based on the fact that an approximation is ok as long as we know the error bound. Changing the representation to allow BigDecimal would be a huge amount of work (basically duplicating all aggregations that use double) and make the cost of maintenance higher.

Not mentioning performance again, any new metric aggregation would need to be written in multiple flavor, one for the common representation and one for BigDecimal.

If the goal is to allow a financial application to perform accurate arbitrary precision computation I'd say that Elasticsearch is probably not the right tool for it. However we could consider adding BigDecimal support for range queries and range aggregations. These feature are isolated per type so we can introduce new format more easily and we also have the data structures to allow some efficiency in the retrieval.

I know that's not what you expect but I wanted to mention it instead of closing this issue because we have no plan to work on the support for metrics aggregation. So the question I have for the followers of this issue is whether it would be useful or not to have the support of range queries and range aggregations on BigDecimal if we don't support them on metrics agg ?

sachnk commented 4 years ago

@jimczi I wanted to add some more context to the use case mentioned by @tzachshabtay as I think it could be useful in reflecting on the decision to unconditionally use doubles for all aggregations.

Our primary storage for all numbers is Postgres. We use their NUMERIC type for arbitrary precision. We then feed this Postgres information into ElasticSearch to perform business analytics akin to what's advertised for ElasticSearch's financial analytics marketing material. When this transfer happens, we see a loss of fidelity when we aggregate because ElasticSearch converts everything to a double, even if we use scaled_float.

I don't think I fully understand the reasoning behind the assumption that all aggregations are approximations. This assumption really breaks down as financial firms adopt both ElasticSearch and crypto, which operates with extremely high levels of precision. Wouldn't it be better if the aggregation engine used generics such that it could operate in either full-fidelity or approximate-fidelity as required by the user?

You mentioned that you would need to duplicate all aggregations to support BigDecimal -- wouldn't using generics and/or well-placed abstractions help here?

jpountz commented 4 years ago

Some more questions to better understand the requirement:

sachnk commented 4 years ago

Apologies for the delay in getting back to you. Answers below:

Would it be fine to enforce the same scale for all values of a field? The issue description suggests no but I'd like to confirm.

For a given column, I think having the same scale for all values would be okay.

Do you actually need numbers greater than 100 billions with 8 decimal places? The reason I'm asking is because a signed fixed point number using 64 bit and providing 8 decimal places could store numbers up to about 92 billions. And being able to use 64-bits longs under the hood would be convenient/fast.

I've personally have used the 64-bit integer technique for storing fixed-point numbers in the past. When I tried that in crypto, it fell on its face because the range assumptions don't work. I think you would need to have a true arbitrary-precision class that isn't limited by 64-bit storage. If implemented with well-placed abstractions, the implementation of this arbitrary-precision class could be backed by 64-bit integer if the range allows. That way, you don't lose performance if you don't need true arbitrary-precision.

What aggs would you need? The issue description suggests sum, I suspect min/max would be useful too, any other agg?

To start off with, we'd be happy with sum, min, and max.

jmouvill commented 4 years ago

Hi,

We also have kind of the same requirements as @sachnk and @tzachshabtay. We are dealing with crypto financial data. For "tick" level data, we don't mind storing the information as string for no loss of precision. Nevertheless, we would like to do different types of aggregations with it. And precision loss is not really an option for us as we are dealing with very precise floating numbers.

We would mostly need the average and the weighted average. Min and max would also be cool but that's less prio for us.

So big +1 for having the BigDecimal type in ES even tho I fully understand that it requires you a lot of refactoring.

elasticmachine commented 4 years ago

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

not-napoleon commented 3 years ago

Relates to #60149

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-search (Team:Search)

kangwen commented 1 year ago

example : block chain. the ETH vs it's minimum unit Wei, 1 ETH = 1^18 Wei ,very big num

Laurianti commented 4 months ago

Any news here?

benwtrent commented 4 months ago

@Laurianti no news. This isn't as simple as adding a new numerical type. We would need to adjust how these values are stored and decoded as Lucene itself doesn't really support accurate decimal number storage to these extreme values.

So, there are three problems:

No movement has been made on this recently.

elasticsearchmachine commented 1 week ago

Pinging @elastic/es-analytical-engine (Team:Analytics)