elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.14k stars 4.91k forks source link

[SecOps][Discuss] Shorten entity IDs #11348

Closed cwurm closed 5 years ago

cwurm commented 5 years ago

The entity IDs for users, processes, etc. introduced with https://github.com/elastic/beats/pull/10500 following a discussion in https://github.com/elastic/beats/issues/10463 are pretty long SHA-256 hashes, represented as 64 hexadecimal characters.

I'm concerned about how much storage space they need, esp. when there are several in a document. For example, a socket is going to have three IDs: user, process, socket (and a shorter fourth, host.id). Furthermore, they are random values, so hard to compress or store efficiently otherwise.

A value collision is very unlikely for us, for a few reasons:

The impact of a collision does not seem fatal either:

The question would be what length to truncate to:

  1. 32 characters / 128 bits: This is the length of a UUID, which we already have as the host.id. It is actually used to compute the entity IDs. Being more unique than the most unique input seems unnecessary.
  2. 16 characters / 64 bits: 1 in a quintillion (10^18) chance among 1 million IDs, 2.7% chance among 1 billion IDs.
  3. Something between the two above.

What do people think?

@elastic/secops @tsg @andrewkroh

andrewkroh commented 5 years ago

+1 to shortening them. I thought they seemed long.

What about shortening the hash and using a different encoding? Base64 packs more information into smaller strings.

https://play.golang.org/p/VHo_7ZIlAja

len(hash): 32 bytes (full sha256 hash)
hex:       a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447 ( 64 chars )
base32:    VFEJATZPB5DZXD4BS5UUWMAYJMGS5UOBZUVB5QH3QXJJTIMSURDQ==== ( 56 chars )
base64:    qUiQTy8PR5uPgZdpSzAYSw0u0cHNKh7A-4XSmaGSpEc= ( 44 chars )
len(hash): 16 bytes (truncated at half size)
hex:       a948904f2f0f479b8f8197694b30184b ( 32 chars )
base32:    VFEJATZPB5DZXD4BS5UUWMAYJM====== ( 32 chars )
base64:    qUiQTy8PR5uPgZdpSzAYSw== ( 24 chars )
len(hash): 12 bytes
hex:       a948904f2f0f479b8f819769 ( 24 chars )
base32:    VFEJATZPB5DZXD4BS5UQ==== ( 24 chars )
base64:    qUiQTy8PR5uPgZdp ( 16 chars )
cwurm commented 5 years ago

@andrewkroh base64 is an excellent idea!

As to the length, I would probably be comfortable with 12 bytes. The chances of a collision among 1 billion IDs is less than 1 in a trillion (10^12) according to p = (n(n-1)/2) * (1/2^96)) from Pro Git.

What do you think?

andrewkroh commented 5 years ago

SGTM

webmat commented 5 years ago

So the current proposal is therefore to pack more information in the string with base64, and then truncate to 12?

I agree with your analysis of the impact of dupes, I think it's an acceptable risk. So the proposal sounds good to me.

Now where should this change take place, and when (a.k.a. for what stack version)?

webmat commented 5 years ago

Just saw the tag. This is limited to Auditbeat?

cwurm commented 5 years ago

So the current proposal is therefore to pack more information in the string with base64, and then truncate to 12?

Yes.

Now where should this change take place, and when (a.k.a. for what stack version)?

With the 6.7 release underway as we speak this is a breaking change, so I would say 7.0.

This is limited to Auditbeat?

Only the Auditbeat system module fills these values at the moment, so yes.

cwurm commented 5 years ago

So the current proposal is therefore to pack more information in the string with base64, and then truncate to 12?

Yes.

The other way around though: Truncate to 12 bytes, then base64 it.

webmat commented 5 years ago

But wouldn't you rather start by packing more information into the shorter string, and only then truncating? This ensures the maximum amount of possible permutations.

cwurm commented 5 years ago

But wouldn't you rather start by packing more information into the shorter string, and only then truncating? This ensures the maximum amount of possible permutations.

Doesn't the same information end up in the resulting string?

  1. Truncate 32 bytes down to 12 (losing 62.5% of information), then base64 down to 8.
  2. Base64 32 bytes down to 22, then truncate to 8 (losing 63.6% of information).

If we always end up with 8 bytes that are completely random, then there's no difference between the two - they're both utilizing the complete space.