Open imotov opened 5 years ago
Pinging @elastic/es-search
We discussed this internally and we are not sure of the overall gain since only one uuid or checksum is used per index in general. Yet we have another proposal that would be easier for users:
Today the _id
field is able to detect automatically that the input is a numeric or a base64 string and performs a conversion to optimize the indexed form (see here). That's a nice feature since users don't need to set the type of _id
they use to optimize the indexation. We could have the same reasoning here and add automatic detection of uuid
or hex
strings in a dedicated field type (_id
is reserved). The closest type that we have is the binary
field which accepts base64 strings only so an idea that popped up is that we could extend the format that the binary
field accepts and implement automatic detection to convert the binary string into an indexed form like we do for the _id
field. The definition of such field would be very simple:
"properties": {
"file_md5": {
"type": "binary",
"index": true
}
}
and multiple formats (base65, uuid) could be used on the same field since detection would be per value. Does that makes sense @imotov ?
We can definitely reuse binary. I would prefer to specify the format explicitly instead of automatically detecting it though. Distinguishing between base64 and hex encoding can be impossible in some cases and quite costly in other (for example we can have a long hex encoded string that will fail to decode as base64 only at the end). I also think that users are unlikely to store checksums, uuids and base64 encoded data in the same field. It could happen accidentally though and having a strict format check would be quite beneficial here. We are reducing leniency in parsing data in most other areas, I think it would be a step backwards to introduce it here.
My use case is a data deduplication. I'm expecting several objects with the same sha1 value to collapse.
"properties": {
"file_md5": {
"type": "binary",
"format": "base16", # Normal hex string (e.g. sha1)
"index": true # Effective search is needed
}
}
Pinging @elastic/es-search (Team:Search)
Pinging @elastic/es-search-foundations (Team:Search Foundations)
Machine-generated data frequently contain data elements that can be described as a fixed size arrays of random bytes. Two main examples of such data are
An important common attribute of such data is that generated strings don’t have any affinity by design. Think md5 checksum of files in the same directory. Most likely two consecutive values will either completely the same or they will be competently different with no common prefixes or suffixes. Therefore they are not easily compressible.
The common use pattern for such data are aggregations (to analyze all events for the object with the same id or find related events that share an object with the same hash) and filtering (narrow down a search to a particular container or find all instance of the same file by checksum).
The current official approach is to index these fields as keywords: https://www.elastic.co/guide/en/ecs/1.1/ecs-hash.html
When we index a set of UUIDs as a keyword field we most likely end up with indexing each UUID as 36 byte string since most of the compression methods will fail unless UUIDs were generated on the same machine at approximately the same time. With md5 it is even worse but we are going to deal with slightly smaller 32 bytes strings (saving 4 bytes on ‘-’ present in uuid). That seems to be not an optimal solution since both uuid and md4 are actually 16 bytes long. So it looks like by knowing that we are dealing with a hexadecimal number, we can get 2 times decrease of the term size as well as add index-time validation for input data.
We could achieve that by adding a new data type that could look something like this:
The format can be
hex
,base64
, oruuid
and we can limit size to some reasonable number. Underneath, the type will behave askeyword
except on indexing we will parse the string into bytes and on the output we will convert the bytes into the string representation according to the specified format.