Closed clintongormley closed 6 years ago
One thing to note here is that our points support is for fixed-width types.
In other words, the BigIntegerPoint in lucene is a little misleading, it does not in fact support "Immutable arbitrary-precision integers".
Instead its a signed 128-bit integer type, more like a long long
. If you try to give it a too-big BigInteger you get an exception! But otherwise BigInteger is a natural api for the user to provide a 128-bit integer.
On the other hand, If someone wanted to add support for a 128-bit floating point type, its of course possible, but I have my doubts there if BigDecimal is even the right java api around that (BigDecimal is a very different thing than a quad-precision floating point type).
I already see some confusion (e.g. "lossless storage") referenced to the issue so I think its important to disambiguate a little.
Maybe names like BigInteger/BigDecimal should be avoided with these, but thats part of why the thing is in sandbox, we can change that (e.g. to LongLongPoint).
thanks for the heads up @rmuir - i was indeed unaware of that
I'd like to collect more information about use-cases before we start implementing this type. For instance I think the natural decision would be to use SORTED_SET
doc values, but if the main use-case is to run stats aggregations, this won't work and so the fact that we have a long long
type will probably be confusing since users won't be able to run the operations that they expect to work.
I agree: we did some digging the other day.
One cause of confusion is many databases have a bigint
type which is really a 64-bit long! So I'm concerned about people using a too-big type when its not needed due to naming confusion.
Also we have the challenge of how such numbers would behave in e.g. scripting and other places. Personally, i've only used BigInteger for cryptography-like things. You can see from its API its really geared at that. So maybe its not something we should expose?
@jpountz:
I'd like to collect more information about use-cases before we start implementing this type. For instance I think the natural decision would be to use SORTED_SET doc values, but if the main use-case is to run stats aggregations, this won't work (…)
Sorry for my newb questions, but why wouldn't this work? Aren't stats aggregations done with floats possibly inaccurate due to floating point arithmetics?
They can be inaccurate indeed.
The point I was making above is that Lucene provides two ways to encode doc values. On the one hand, we have SORTED_SET
, which assigns an ordinal to every value per segment. This way you can efficiently sort and run terms, cardinality or range aggregations since these operations can work directly on the ordinals. However the cost of resolving a value given an ordinal is high enough that it would make anything that needs to have access to the actual values slow, such as a stats aggregation, just because it needs access to the actual values. On the other hand, there is BINARY
, which just encodes the raw binary values in a column stride fashion. This would be slower for sorting and terms/cardinality/range aggregations, but reading the original values would be faster than with SORTED_SET
so we could theoretically run eg. stats aggregations or use the values in scripts.
So knowing about the use-cases will help figure out which format to use. But then if we want to leverage all 128 bits of the values, we will have to duplicate implementations for everything that needs to add or multiply values such as stats/sum/avg aggregations. This would be an important burden in terms of maintenance so we would certainly not want to go that route without making sure that there are valid/common use-cases for it first.
This feature would be useful for the Digital Forensics and Indecent Response (DFIR) community. There are lots of data structures we look at that have uint64 types. When we index these, if the field is considered a long and the value is out of range, information can be lost.
I see a 64 bit unsigned integer type (versus the 64-bit signed type we have), as a separate feature actually. This can be implemented more efficiently with lucene (and made easier with java 8).
Yeah, figuring out how to make a 64-bit unsigned type work efficiently in say, the scripting API might be a challenge as it stands today. Perhaps it truly must be a Number backed by BigInteger to work the best today, which would be slower.
But in general, typical things such as ranges and aggregations would be as fast as the 64-bit signed type we have today, and perhaps a newer scripting api (with more type information) could make scripting faster too down the road, so it is much more compelling than larger integers (e.g. 128-bit), which will always be slower.
Use cases where BigInteger is truly needed, to me that situation is less clear. I would like for us to consider the two cases (64-bit unsigned vs larger integers) as separate.
@rmuir it's surprising to me that you have to ask for cases where BigDecimal
(i.e. a decimal representation with arbitrary precision) would be needed, as much data science/analytics work requires exact representations of the source data without loss of precision. If putting my data into ES means that I am necessarily going to lose precision, that's a non-starter for many uses. Nothing in the JSON spec suggests this. In fact, it expressly mentions that numerics are arbitrary precision and it is up to the various libraries to represent that properly.
Nothing in the JSON spec suggests this. In fact, it expressly mentions that numerics are arbitrary precision and it is up to the various libraries to represent that properly.
This is not correct; the spec says:
This specification allows implementations to set limits on the range and precision of numbers accepted.
You are correct that numerics in the JSON spec are arbitrary precision, but nothing in the spec suggests that implementations must support this and, in fact, implementations do not have to support this.
The spec further says:
Since software that implements IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision.
@jasontedor I was referring to ECMA-404 but regardless, my point is that the elastic documentation specifically says that _source
, for example, contains the original JSON message verbatim and is used for search results. I think you'd have to heavily amend statements like that in the documentation to explicitly describe how JSON numbers are handled internally in ES.
You also cut your quoting of the spec short, as the entire paragraph is:
This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.
This is exactly what I'm referring to, as "the software that created it" (i.e. a client) has no reason to suspect, based on the documentation, that either of these values would lose precision.
@jeffknupp
it's surprising to me that you have to ask for cases where BigDecimal would be needed
We are asking for use-cases because depending on the expectations, the feature could be implemented in very different ways.
For instance a MySQL BIGINT
is just a 64-bits integer, which we already support with the long
type. We do not support unsigned numbers, but if that is a common need, then this could be something we could fix and support efficiently.
If the use-case requires more than 64 bits (eg. 128), then things are more complicated. We could probably support efficient sorting, but aggregations would be tricky.
If arbitrary precision is needed, then there is not much we can do efficiently, at least at the moment.
I was referring to ECMA-404
The JSON spec only spells out the representation in JSON which is used for interchange, it is completely agnostic to how such information is represented by software consuming such JSON.
I think you'd have to heavily amend statements like that in the documentation to explicitly describe how JSON numbers are handled internally in ES.
The documentation spells out the numeric datatypes that are supported.
Here is a good example.
Windows uses the USN Journal to record changes made to the file system. These records are extremely important "logs" for people in the DFIR community.
Version 2 records uses 64 bit unsigned integer to store reference numbers.
Version 3 records uses 128-bit ordinal number for reference numbers.
For instance a MySQL BIGINT is just a 64-bits integer, which we already support with the long type. We do not support unsigned numbers, but if that is a common need, then this could be something we could fix and support efficiently.
I would say that this is important for the DFIR community.
If the use-case requires more than 64 bits (eg. 128), then things are more complicated. We could probably support efficient sorting, but aggregations would be tricky.
I would say this is equally as important.
There are many other logs that record these references, thus by maintaining their native types we can correlate logs to determine certain types of activity.
Use cases where BigInteger is truly needed, to me that situation is less clear. I would like for us to consider the two cases (64-bit unsigned vs larger integers) as separate.
Should we go ahead and create a new issue for 64-bit unsigned type as a feature?
For instance a MySQL BIGINT is just a 64-bits integer, which we already support with the long type. We do not support unsigned numbers, but if that is a common need, then this could be something we could fix and support efficiently.
I'm also in the digital forensics world and see merit in providing a 64-bit unsigned type. If it were 128-bit with a speed impact, it wouldn't affect the way in which I process data. My use is less real time and more one time run bulk processing. The biggest factor to me would be what makes the most sense from the developer side in respect to java and OS integration.
Spring Data JPA supports BigInteger and BigDecimal, so any code where you try also use elasticsearch with will fail:
/** Spring Data ElasticSearch repository for the Task entity. */
public interface TaskSearchRepository extends ElasticsearchRepository<Task,BigInteger> {
//THIS COMPILES BUT FAILS ON INIT
}
/** Spring Data JPA repository for the Task entity. */
@SuppressWarnings("unused")
public interface TaskRepository extends JpaRepository<Task,BigInteger> {
//THIS IS OK
}
I think a hack (that may end up being almost as efficient) is to convert my BigInteger to a string for use with elasticsearch:
/** Spring Data ElasticSearch repository for the Task entity. */
public interface TaskSearchRepository extends ElasticsearchRepository<Task,String> {
//HACK, convert biginteger to string when saving to elasticsearch...
}
So these data types should be added in my opinion.
We also need something like this, We are unable to store C#'s Decimal.MaxValue currently.
On use cases, I see DFIR and USN mentioned. Would either of these use cases use aggregations, or just search and sorting? If you see aggregations necessary, can you state which ones and what the use case for that is?
Apologies if I am oversimplifying, but it seems like:
If search and sort is enough, and no aggregations are needed, I wonder if there is even need for a 128 bit numeric type-- could strings be enough for these use cases, even if they may have speed differences from a (theoretical) 128bit type?
(Oops, clicked the wrong button. Reopened.)
If search and sort is enough, and no aggregations are needed, I wonder if there is even need for a 128 bit numeric type-- could strings be enough for these use cases, even if they may have speed differences from a (theoretical) 128bit type?
If the workload consists of exact search and sort (no aggs), then strings are the way to go indeed. The reason why I am interested in the workload is that Lucene provides better ways to index the data if range queries are important, but this only works with fixed-size data up to 16 bytes / 128 bits.
I second the DFIR-related cases and add that we are generally beholden to the data types present in our evidence - as more operating systems and applications move to use and store larger values, we must adapt. Truncating or throwing away information because of data type issues is dangerous.
A corollary use case has been identified in a number of forensic tools that can not parse emoji or other unicode characters. Another is in lack of IP address functionality for IPv6 addresses. If a tool parsing our source data cannot support those types, we lose critical context and content.
Given the incredibly valuable use case Elastic and friends have in the DFIR world, staying ahead of our source data is very important. I hope to see these data types included in the near future.
@philhagen Thanks for the information. I have some questions (similar to what I commented on above)
What data are you storing? What operations are you doing on this data?
If the operations are search-and-sort, you can achieve this today with no changes to Elasticsearch by telling Elasticsearch to map your large-integers as strings. You'll get search capability and sort capability from this.
For full context, it would be helpful beyond "DFIR may use large numbers" to get more specific. What are the numbers you need to store, what are the properties and semantics of these numbers, and what operations are you needing to do across a large set of these numbers?
What data are you storing? Here's a couple examples of data the DFIR community frequently stores in Elasticsearch.
Microsoft Windows stores FILETIME timestamps as 100-nanosecond intervals since January 1, 1601 (UTC) in a ULARGE_INTEGER (64-bit unsigned integer value).
The Windows update sequence number (USN) change journal struct contains a FILE_ID_128 struct that holds a 128-bit file identifier.
@jordansissel sure thing - always happy to help!
The store-as-string solution would be viable, but we often do range-based searches against the numbers (data transferred between x and y bytes), the 100ns-interval timestamps @danzek describes searched after
I fully acknowledge that a good deal of this use case is hand-wavy and that this is not as helpful as a hard-core use case. All I can think of is that when source data is typed a certain way, tools that become the best and the most used/loved are those that accommodate the new data types.
I'm gonna close my eyes and hand-wave a USN suggestion as a workaround: that today's USN date has a set number of digits that won't add a new digit for a while (I haven't done the math, but it should be on the order of years), so relative range queries on USN-as-string should be OK given they all have the same [hand waving intensifies] number of digits?
(This comment is partly to suggest a specific workaround and partly for humor)
I appreciate both workarounds and humor! :)
You're correct - for that particular value, the first digits would be sufficient (and, I've taken to this exact approach with a few cases where the really minute detail is not critical, truncating the number at a certain number of digits).
The big challenge though is not what we're seeing TODAY, it's what we might encounter in the future. I know that doesn't really lend itself to hard-sell use cases and prioritization of these issues. However, it's often the case that "we never needed to parse the
Not an easy task - but I hope we can shed some light on the potential importance for numerically handling the larger values. In time it will become core.
Howdy, I come from the DFIR world, but know Lucene well so maybe can bridge the divide a bit. ES is quickly becoming a handy tool as we deal with a variety of structures, ever changing, and it's usually not too hard to convert them to json and ingest into ES for searching and sorting.
Many of these structures come from the filesystem. There we often encounter 64-bit and sometimes 128-bit integers. These can be inode-like file identifiers or file offsets. Most of the time these integers will be ordinal, so a varint encoding makes sense—the number may be bigger than 2^32 but rarely uses the full 64 bits (let alone 128 bits). Still, we are pedantic people and encounter really weird data from time to time that can make a mockery of attempts to treat 64 uints like 64 bit signed ints, etc. There may be other times the numbers could effectively be hashes/randomized, but those cases are more rare.
I am hard-pressed to think of a need for aggregations on such fields, with the exception of a "file size". However, the distribution would be heavily skewed to sizes < 2^32, so "> 4GB" could always be an option.
We also encounter weird timestamp formats (like the aforementioned NTFS FILETIME, that's 100ns increments since 1601; but there's Apple's Absolute Time, the number of seconds since 2001, usually stored as a double). With these timestamps, sometimes it's nice to be able to represent the timestamp as it was (an integer, a double, a string), but what is badly needed is an arbitrary precision timestamp type. This would ensure we could normalize the variety of timestamps we encounter so we could compare apples to apples, but without losing precision. In fact, when a high precision timestamp looks like it's been truncated (has had some fractional part zeroed), that is usually a mark of manipulation, which is very interesting to us.
The unfortunate news is that with an arbitrary precision timestamp, we'd need the usual array of aggregations. Other than knowing that a class of timestamps has low precision, I don't think we have much need for high-precision aggregations. We'll want to drill down by year/month/day/day of week/time of day, etc., but I doubt we need to break things down below a second.
It's only speculation on my part, but my guess is that the scientific world also has need for high-precision timestamps.
Our use case in DFIR is heavily batch-oriented, where we usually want to ingest a bulk data set as fast as possible, and then the indices are relatively static. I don't think anyone would notice if queries had to suffer on performance in exchange for improved precision: our other tools can take days to return results, so ES/Lucene's sub-second query times seem other worldly. We wouldn't notice a few milliseconds lost there.
This could also solve the range issue on UUID-4. Which I would be a huge fan of.
This would be really appreciated because I need to be able to save arbitrary-precision timeseries values. I'm currently working around this using strings
In which release of ES is the plan to support big decimal/ big integers ?
@Felk The specific area of higher-precision time is being tracked here: https://github.com/elastic/elasticsearch/issues/10005
@SKumarMN None for the moment, we are still trying to figure out the most common use-case for large numbers and making sure we would be able to support them consistently. For instance if this involves running numeric aggregations, then the best option would probably be to use doubles. On the other hand, we'd probably be able to support range queries and aggregations efficiently on big numbers. Also how many bits are required is an interesting question. For instance what some other databases call a big int is just an unsigned 64-bits long.
@jordansissel thanks, but I don't need high precision timestamps, I need high precision values. With timeseries values I meant tuples of a timestamp and a high precision value
For those in the Data Science community, it is very common to work with decimal values requiring very large (but not arbitrarily so) precision. If precision is lost (and done so without warning, as is the case today), the data is effectively useless. I care much more about the decimal than integer case, but it seems that the integer case is common in the DFIR realm. I would think those to be two compelling problem domains for ES to serve.
I just need an unsigned integer long 64 bit. This cannot be that hard to support since Java 8 supports it.
https://blogs.oracle.com/darcy/unsigned-integer-arithmetic-api-now-in-jdk-8
On Sun, Jul 2, 2017 at 1:36 AM, Jeff Knupp notifications@github.com wrote:
For those in the Data Science community, it is very common to work with decimal values requiring very large (but not arbitrarily so) precision. If precision is lost (and done so without warning, as is the case today), the data is effectively useless. I care much more about the decimal than integer case, but it seems that the integer case is common in the DFIR realm. I would think those to be two compelling problem domains for ES to serve.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/17006#issuecomment-312478291, or mute the thread https://github.com/notifications/unsubscribe-auth/AYM5QVNEq6UcTeiZ3_a72X1Vf1mfDCfVks5sJ1aqgaJpZM4HrxhW .
-- Bill Bell billnbell@gmail.com cell 720-256-8076
Supporting unsigned integers should be relatively easy from a querying and sorting perspective I believe. Aggregations are a different story however.
I'm working with a use case where I have an index containing revisions of a piece of news content. Revisions are part of exactly one story, and the output of any query should only be the most recent revision within any given story group. It is simple enough to run queries against this index under the constraints, given that I can aggregate stories by their group id and return the most recently published revision. I use external versioning, and the value is constructed using the timestamp of when the revision was updated. Update events come in and if they are out of order only the most recent version is indexed.
Now I'm looking to remove the necessary aggregation stage and maintain a separate index which contains only the most recent revision of every story group. In theory it should be simple enough to make the document id the story group id and construct the version as the concatenation of publish time and update time. This way every event that comes in will only be indexed if it is for the most recently published revision and is the most recent update. However, the timestamps for the publish and update time require 64 bits because of the required precision (updates can come very rapidly). If the version could be a 128-bit integer this design could be accomplished.
Some use-cases described on this issue do not need biginteger/bigdecimal:
_source
document all the time, you can map it as a disabled object
to tell Elasticsearch to store it but not try to index or add doc values for it (meaning it will be returned but can't be searched or aggregated)keyword
will allow to support exact queries and terms
aggregations.In general it looks like there is more interest in big integers than decimals. In particular, some use-cases look like they could benefit from unsigned 128-bits integers because they need ordering, which keyword
cannot provide. It's still unknown whether range queries would be needed on such a field however.
There seems to be less traction for big decimals. @jeffknupp Can you clarify what operations you would like to run on these big decimal fields (exact queries? range queries? sorting? aggregations?).
cc @elastic/es-search-aggs
For Ethereum contracts, integers default to 256 bit so this is an issue. Lucene doesn't support that large of an integer, so it seems out of the question, but 128 bits would cover a far larger set of values for aggregation, analysis, querying, etc.
@tyre What kind of aggregations and querying would you perform on such a field?
@jpountz off the top of my head: sum, average, moving average, percentile, percentile rank, filter
sum, average, moving average, percentile, percentile rank
I don't think we will ever support these aggregations on large integers. Numeric aggregations use doubles internally, so either we support big integers and still use doubles internally but then the fact we have big integers is pointless as they could just be indexed as doubles instead. Or we try to make aggregations support wider data types but it will make them slower which is also something we want to avoid. So I don't see this happening. The only aggregation that we could support on big integers would be the range aggregation.
Some data: Beats are interested in supporting uint64, which they typically need for OS-level counters, and they would be fine with the accuracy loss due to the fact that these numbers would be converted to doubles for aggregations.
Do we have any update of this case? Still will not support BigInteger and BigDecimal officially?
@insukcho No, no support for BigInteger and BigDecimal. Note that the naming may be a bit confusing due to the fact that what some datastores call bigint map to our longs. For instance both Mysql's BIGINT
and Postgresql's bigint
are 64-bit integers, just like Elasticsearch's long
.
We discussed this issue in FixitFriday and agreed to implement 64-bit unsigned integers. I opened #32434. Thanks all for the feedback.
@jpountz Thanks for taking the time to keep this under the radar.
Initially, my interest on this issue was not to have a custom/new datatype per se, but to have support for BigDecimal/BigInteger (the java objects) on the Elasticsearch API (TransportClient using BulkProcessor, to be specific). I had to implement a generic number normalization to bring everything to it's pure and non-scientific-notation representation to be able to send data properly to elasticsearch, because when I tried to simply proxy my ETL input to the Elasticsearch client, I'd get an error for BigDecimal/BigInteger don't have a mapped type on the translating API. To be honest, I first got that issue on a 2.4.x cluster/api, and I'm on the way to finish migrating to 6.3.x, and have not tried removing numeric normalization to see if the limitation still exists (please feel free to point me to any obscure point on the changelogs or commit that would make me happy).
Although I'm sure 64-bit uint will solve most issues for people that wanted a new datatype for really long numbers, this issue of mine doesn't get attention by proxy with that implementation. Are there plans to support in any way the translation of BigDecimal/BigIntegers from the java client perspectives (even if it means an error/warning when the value would incur in precision loss)?
I would expect this issue to be specific to the transport client, which we want to replace with a new rest client, which we call high-level rest client as opposed to the other low-level rest client which doesn't try to understand requests and responses and only works with maps of maps. With a rest client, bigintegers wouldn't be transported any differently from short, ints and longs, so I would expect things to work as long as the value that your big integers store are in the acceptable range of the mapping type, eg. -2^63 to 2^63-1 for long
.
Lucene now has sandbox support for BigInteger (LUCENE-7043), and hopefully BigDecimal will follow soon. We should look at what needs to be done to support them in Elasticsearch.
I propose adding
big_integer
andbig_decimal
types which have to be specified explicitly - they shouldn't be a type which can be detected by dynamic mapping.Many languages don't support big int/decimal. Javascript will convert to floats or throw an exception if a number is out of range. This can be worked around by always rendering these numbers in JSON as strings. We can possibly accept known bigints/bigdecimals as numbers but there are a few places where this could be a problem:
The above could be worked around by telling Jackson to parse floats and ints as BIG* (
USE_BIG_DECIMAL_FOR_FLOATS
andUSE_BIG_INTEGER_FOR_INTS
) but this may well generate a lot of garbage for what is an infrequent use case.Alternatively, we could just say that Big* should always be passed in as strings if they are to maintain their precision.