[SecOps][Discuss] Shorten entity IDs

cwurm commented 5 years ago

The entity IDs for users, processes, etc. introduced with https://github.com/elastic/beats/pull/10500 following a discussion in https://github.com/elastic/beats/issues/10463 are pretty long SHA-256 hashes, represented as 64 hexadecimal characters.

I'm concerned about how much storage space they need, esp. when there are several in a document. For example, a socket is going to have three IDs: user, process, socket (and a shorter fourth, host.id). Furthermore, they are random values, so hard to compress or store efficiently otherwise.

A value collision is very unlikely for us, for a few reasons:

With time-based data, a user would not be looking at all data in an ES cluster, but only ever a subset of it, e.g. a few days at most. Only collisions in the searched time frame are really relevant.
For host-based data, users are most likely going to look at data from one host only, which significantly reduces the searched scope.
We have several entity IDs in different fields, so only collisions in the same field are relevant. To be fair, some entity IDs are going to be pretty common, e.g. for a user.

The impact of a collision does not seem fatal either:

We would not be missing any events, it would only really affect aggregation results on the entity ID. Some of these aggregations are approximate at large cardinalities anyway, e.g. Elasticsearch's cardinality agg itself. And since the entity ID values are calculated from other fields, it would always be possible to distinguish documents by looking at these other fields. The entity ID is just a shortcut to not have to do that all the time.

The question would be what length to truncate to:

32 characters / 128 bits: This is the length of a UUID, which we already have as the host.id. It is actually used to compute the entity IDs. Being more unique than the most unique input seems unnecessary.
16 characters / 64 bits: 1 in a quintillion (10^18) chance among 1 million IDs, 2.7% chance among 1 billion IDs.
Something between the two above.

What do people think?

@elastic/secops @tsg @andrewkroh

andrewkroh commented 5 years ago

+1 to shortening them. I thought they seemed long.

What about shortening the hash and using a different encoding? Base64 packs more information into smaller strings.

https://play.golang.org/p/VHo_7ZIlAja

len(hash): 32 bytes (full sha256 hash)
hex:       a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447 ( 64 chars )
base32:    VFEJATZPB5DZXD4BS5UUWMAYJMGS5UOBZUVB5QH3QXJJTIMSURDQ==== ( 56 chars )
base64:    qUiQTy8PR5uPgZdpSzAYSw0u0cHNKh7A-4XSmaGSpEc= ( 44 chars )

len(hash): 16 bytes (truncated at half size)
hex:       a948904f2f0f479b8f8197694b30184b ( 32 chars )
base32:    VFEJATZPB5DZXD4BS5UUWMAYJM====== ( 32 chars )
base64:    qUiQTy8PR5uPgZdpSzAYSw== ( 24 chars )

len(hash): 12 bytes
hex:       a948904f2f0f479b8f819769 ( 24 chars )
base32:    VFEJATZPB5DZXD4BS5UQ==== ( 24 chars )
base64:    qUiQTy8PR5uPgZdp ( 16 chars )

cwurm commented 5 years ago

@andrewkroh base64 is an excellent idea!

As to the length, I would probably be comfortable with 12 bytes. The chances of a collision among 1 billion IDs is less than 1 in a trillion (10^12) according to p = (n(n-1)/2) * (1/2^96)) from Pro Git.

What do you think?

andrewkroh commented 5 years ago

SGTM

webmat commented 5 years ago

So the current proposal is therefore to pack more information in the string with base64, and then truncate to 12?

I agree with your analysis of the impact of dupes, I think it's an acceptable risk. So the proposal sounds good to me.

Now where should this change take place, and when (a.k.a. for what stack version)?

webmat commented 5 years ago

Just saw the tag. This is limited to Auditbeat?

cwurm commented 5 years ago

So the current proposal is therefore to pack more information in the string with base64, and then truncate to 12?

Yes.

Now where should this change take place, and when (a.k.a. for what stack version)?

With the 6.7 release underway as we speak this is a breaking change, so I would say 7.0.

This is limited to Auditbeat?

Only the Auditbeat system module fills these values at the moment, so yes.

cwurm commented 5 years ago

So the current proposal is therefore to pack more information in the string with base64, and then truncate to 12?

Yes.

The other way around though: Truncate to 12 bytes, then base64 it.

webmat commented 5 years ago

But wouldn't you rather start by packing more information into the shorter string, and only then truncating? This ensures the maximum amount of possible permutations.

cwurm commented 5 years ago

But wouldn't you rather start by packing more information into the shorter string, and only then truncating? This ensures the maximum amount of possible permutations.

Doesn't the same information end up in the resulting string?

Truncate 32 bytes down to 12 (losing 62.5% of information), then base64 down to 8.
Base64 32 bytes down to 22, then truncate to 8 (losing 63.6% of information).

If we always end up with 8 bytes that are completely random, then there's no difference between the two - they're both utilizing the complete space.

elastic / beats

[SecOps][Discuss] Shorten entity IDs #11348