Indexing of checksums and UUIDs

imotov commented 5 years ago

Machine-generated data frequently contain data elements that can be described as a fixed size arrays of random bytes. Two main examples of such data are

ids: uuids, docker container ids
checksums: md5, sha1, sha256, sha512, etc

An important common attribute of such data is that generated strings don’t have any affinity by design. Think md5 checksum of files in the same directory. Most likely two consecutive values will either completely the same or they will be competently different with no common prefixes or suffixes. Therefore they are not easily compressible.

The common use pattern for such data are aggregations (to analyze all events for the object with the same id or find related events that share an object with the same hash) and filtering (narrow down a search to a particular container or find all instance of the same file by checksum).

The current official approach is to index these fields as keywords: https://www.elastic.co/guide/en/ecs/1.1/ecs-hash.html

When we index a set of UUIDs as a keyword field we most likely end up with indexing each UUID as 36 byte string since most of the compression methods will fail unless UUIDs were generated on the same machine at approximately the same time. With md5 it is even worse but we are going to deal with slightly smaller 32 bytes strings (saving 4 bytes on ‘-’ present in uuid). That seems to be not an optimal solution since both uuid and md4 are actually 16 bytes long. So it looks like by knowing that we are dealing with a hexadecimal number, we can get 2 times decrease of the term size as well as add index-time validation for input data.

We could achieve that by adding a new data type that could look something like this:

  "properties": {
    "file_md5": {
      "type": "byte_array",
      "size": 16,
      "format": "hex"
    }
  }

The format can be hex , base64, or uuid and we can limit size to some reasonable number. Underneath, the type will behave as keyword except on indexing we will parse the string into bytes and on the output we will convert the bytes into the string representation according to the specified format.

elasticmachine commented 5 years ago

Pinging @elastic/es-search

jimczi commented 5 years ago

We discussed this internally and we are not sure of the overall gain since only one uuid or checksum is used per index in general. Yet we have another proposal that would be easier for users:

Today the _id field is able to detect automatically that the input is a numeric or a base64 string and performs a conversion to optimize the indexed form (see here). That's a nice feature since users don't need to set the type of _id they use to optimize the indexation. We could have the same reasoning here and add automatic detection of uuid or hex strings in a dedicated field type (_id is reserved). The closest type that we have is the binary field which accepts base64 strings only so an idea that popped up is that we could extend the format that the binary field accepts and implement automatic detection to convert the binary string into an indexed form like we do for the _id field. The definition of such field would be very simple:

"properties": {
    "file_md5": {
      "type": "binary",
      "index": true
    }
  }

and multiple formats (base65, uuid) could be used on the same field since detection would be per value. Does that makes sense @imotov ?

imotov commented 5 years ago

We can definitely reuse binary. I would prefer to specify the format explicitly instead of automatically detecting it though. Distinguishing between base64 and hex encoding can be impossible in some cases and quite costly in other (for example we can have a long hex encoded string that will fail to decode as base64 only at the end). I also think that users are unlikely to store checksums, uuids and base64 encoded data in the same field. It could happen accidentally though and having a strict format check would be quite beneficial here. We are reducing leniency in parsing data in most other areas, I think it would be a step backwards to introduce it here.

eugene-bright commented 2 years ago

My use case is a data deduplication. I'm expecting several objects with the same sha1 value to collapse.

"properties": {
    "file_md5": {
      "type": "binary",
      "format": "base16",  # Normal hex string (e.g. sha1)
      "index": true  # Effective search is needed
    }
  }

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elastic / elasticsearch

Indexing of checksums and UUIDs #46197