AlexZeDim / cmnw

17 Nest.js mircroservices within one monorepo which serves a common-goal purpose of indexing information from various sources within the community.
https://cmnw.me
2 stars 0 forks source link

OSINT: crc32 has poor hash quality #27

Closed RedGateway closed 3 years ago

RedGateway commented 3 years ago
  1. For db of all EU characters, there will be at least one hash collision in each large guild roster.
  2. CRC codes are not designed for hashing.

https://en.wikipedia.org/wiki/Birthday_problem https://eklitzke.org/crcs-vs-hash-functions

AlexZeDim commented 3 years ago

I'll be honest, I am humbled. Never thought, that this repo will be reviewed by someone, except me and my friends.

I understand both problems.

Well, let me clarify it, I am using CRC32 for evaluating so called 'hash' not for real hashing the values themselves. It's easy to saw, that the initial values are pets, mount names (and levels) & achievements which will give us a collision by default. Especially for low-level characters. So having a small collision chance on every million characters I consider as a small price.

The real reason why I choose checksum algorithms (the first version was Adler32) instead of real hashing, is their performance over sha1, md5, and so on. Since I accept the collision as a necessary evil, I don't want to waste my process time on calculating real-unique-with-avalanche-effect and so on.

But if you have candidates over crc32 feel free to propose. I could integrate them aswell.

RedGateway commented 3 years ago

Unfortunately, there aren't many repositories on github with code that searches for alts in WOW. I have my own solution in the form of an addon and scripts (not on github), so I will keep an eye on your code :-)

Of course, pets, mounts and achievements give collisions, but is there so much? Let's figure it out.

For achievements. Having, for example, 20 independent achievements for each character, and counting at least the day of completion as random, we will get the following number: 7 days ^ 20 ~= 2^56. We didn't take into account the number of permutations in the sequence of achievements, distributions, and much more, but the number already looks large.

For mounts, without distributions, you can't get plausible numbers, but here's a very rough estimate. Number of combinations of 30 equally probable mounts, with at least 10 such mounts: 30045015

Now back to crc32. For simplicity, ignore the "biased output". Then the probability of one collision for db with 70k characters is already ~= 50%. For 20 million characters, taking into account the quality of the function, the order of the number of collisions will already be about one for every thousandth character, or more often. This is a lot!

Conclusion: the quality of the function is not so important, but the bit depth should be increased to 48 or 52 bits (more than 53 does not fit into the double type).

As functions, you can take any fast one, such as FNV-1a or from the Murmur family, by combining their hashes.

RedGateway commented 3 years ago

Made a test for collisions https://gist.github.com/RedGateway/3dfab8a93df66cff8bcebea1cac60919 The results are below. So if there are a lot of characters - in each guild wait for the black sheep.

CRC32 run: Added 16000000 values, one collision for 538 elements Total values: 16000000, collisions: 29791 in 34.002 seconds

murmur2 run: Added 16000000 values, one collision for 540 elements Total values: 16000000, collisions: 29610 in 25.854 seconds

CRC32 + murmur2 run: Total values: 16000000, collisions: 0 in 35.601 seconds

SipHash run: Total values: 16000000, collisions: 0 in 55.739 seconds

AlexZeDim commented 3 years ago

The character's DB itself has only ~12M values, That number included all characters across the EU region (238 realms) and growing (very) slowly. I would say, that this number. or less is something like all real-active game characters during the last 4 months. First levels twinks don't count

The real problem is multiple look-a-like characters, like freshly boosted 60-levels, and low-levels ~10. Usually, they have almost the same pets & mount collections and BoA achievements. Which has made them almost impossible to identify, if their owners won't like to do it by themselves, via the OAuth endpoint.

I heard about murmur family functions. But actually, SipHash deserves, to take a look.

P.S. I remember that day when I run almost the same test between Adler-32 and CRC-32c. Functions with small output, >9 symbols. Having me this information... well the decision that was made, could be much better.

RedGateway commented 3 years ago

The real problem is multiple look-a-like characters

Yep, some chars are undistinguishable (hope so is their behavior). In general it's about managing good false positive / FN ratio, tuned by your needs, and combining all sources of entropy (for me, FP is much worse, and i prefer to say: it's a fresh char, we can't identify it). The good news: with enought hash depth bad chars are bounded to <10% of all char base (1M in your case), and collide only with each other (with a highly reduced chance). For the rest, we can have ~0 FPs/FNs.

Indeed, there are many questions, for example: whether to use Locality Sensitive Hashing for volatile data such as a list of pets / mounts, or just wait for the characters to be indexed eventually when changes occur; or use additional data for matching (achievements). And what to do with the fact that the dates of achievements occasionally differ in (now) one account, for example: Шэдоуфлэйм-gordunni and Мастершот-gordunni, achievements 5749, 2142, 2078 etc. A lot of fun :)

AlexZeDim commented 3 years ago

@RedGateway I try to take the best from your advice and take a close look at SipHash & MurMurHash vector. As for now, I implement Hash64 function from Google's FarmHash.

Since it gives me BigNumber values from this operation. And hex(hash) values like: 19AB A81D 8E69 2397 I guess it is much safer than before, and have a pretty cool look.

So, since crc-32c no more, I guess this Issue could be closed for now. But don't be shy and feel free to comment or create more issues, if you find anything useful to the case.

I guess you know, that I really appreciate it.

...and one more thing, to which character should I mail in-game gold as a thankful gift? You could mention the nickname to the Discord, or I will find way.