Closed cwurm closed 5 years ago
+1 to shortening them. I thought they seemed long.
What about shortening the hash and using a different encoding? Base64 packs more information into smaller strings.
https://play.golang.org/p/VHo_7ZIlAja
len(hash): 32 bytes (full sha256 hash)
hex: a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447 ( 64 chars )
base32: VFEJATZPB5DZXD4BS5UUWMAYJMGS5UOBZUVB5QH3QXJJTIMSURDQ==== ( 56 chars )
base64: qUiQTy8PR5uPgZdpSzAYSw0u0cHNKh7A-4XSmaGSpEc= ( 44 chars )
len(hash): 16 bytes (truncated at half size)
hex: a948904f2f0f479b8f8197694b30184b ( 32 chars )
base32: VFEJATZPB5DZXD4BS5UUWMAYJM====== ( 32 chars )
base64: qUiQTy8PR5uPgZdpSzAYSw== ( 24 chars )
len(hash): 12 bytes
hex: a948904f2f0f479b8f819769 ( 24 chars )
base32: VFEJATZPB5DZXD4BS5UQ==== ( 24 chars )
base64: qUiQTy8PR5uPgZdp ( 16 chars )
@andrewkroh base64 is an excellent idea!
As to the length, I would probably be comfortable with 12 bytes. The chances of a collision among 1 billion IDs is less than 1 in a trillion (10^12) according to p = (n(n-1)/2) * (1/2^96))
from Pro Git.
What do you think?
SGTM
So the current proposal is therefore to pack more information in the string with base64, and then truncate to 12?
I agree with your analysis of the impact of dupes, I think it's an acceptable risk. So the proposal sounds good to me.
Now where should this change take place, and when (a.k.a. for what stack version)?
Just saw the tag. This is limited to Auditbeat?
So the current proposal is therefore to pack more information in the string with base64, and then truncate to 12?
Yes.
Now where should this change take place, and when (a.k.a. for what stack version)?
With the 6.7 release underway as we speak this is a breaking change, so I would say 7.0.
This is limited to Auditbeat?
Only the Auditbeat system module fills these values at the moment, so yes.
So the current proposal is therefore to pack more information in the string with base64, and then truncate to 12?
Yes.
The other way around though: Truncate to 12 bytes, then base64 it.
But wouldn't you rather start by packing more information into the shorter string, and only then truncating? This ensures the maximum amount of possible permutations.
But wouldn't you rather start by packing more information into the shorter string, and only then truncating? This ensures the maximum amount of possible permutations.
Doesn't the same information end up in the resulting string?
If we always end up with 8 bytes that are completely random, then there's no difference between the two - they're both utilizing the complete space.
The entity IDs for users, processes, etc. introduced with https://github.com/elastic/beats/pull/10500 following a discussion in https://github.com/elastic/beats/issues/10463 are pretty long SHA-256 hashes, represented as 64 hexadecimal characters.
I'm concerned about how much storage space they need, esp. when there are several in a document. For example, a socket is going to have three IDs: user, process, socket (and a shorter fourth,
host.id
). Furthermore, they are random values, so hard to compress or store efficiently otherwise.A value collision is very unlikely for us, for a few reasons:
The impact of a collision does not seem fatal either:
cardinality
agg itself. And since the entity ID values are calculated from other fields, it would always be possible to distinguish documents by looking at these other fields. The entity ID is just a shortcut to not have to do that all the time.The question would be what length to truncate to:
host.id
. It is actually used to compute the entity IDs. Being more unique than the most unique input seems unnecessary.What do people think?
@elastic/secops @tsg @andrewkroh