ageron / handson-ml

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Apache License 2.0
25.12k stars 12.91k forks source link

Need help understanding crc hash used to explain test train split in Chapter 2 #655

Open paliwalvarad opened 2 years ago

paliwalvarad commented 2 years ago

Hi there, in the following code, I fail to understand how the hash function is working here and how would it make sure that the test set is randomly selected. Can you shed some light on this please? Thank you.

def test_set_check(identifier,test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio *2**32)
ageron commented 2 years ago

Hi @paliwalvarad ,

Sure. The CRC-32 algorithm takes the identifier and converts it to a (somewhat) pseudo-random 32-bit number. In other words, we're using the crc32() function as a hash function: the same identifier will always produce the same CRC-32 "hash", but two different identifiers will generally produce very different (seemingly random) hashes.

The & 0xffffffff can be removed: it doesn't hurt, but it was only needed in Python 2 to convert the output of crc32() from a signed 32-bit integer to an unsigned 32-bit integer.

So now we have a 32-bit signed integer. We check whether it's smaller than test_ratio * 2**32. The largest possible CRC-32 value is 0xffffffff, which is 232 - 1. So if test_ratio is equal to, say, 0.2, then the pseudo-random CRC-32 will only be smaller than 232 about 20% of the time, on average over all possible identifiers.

In short, this function takes an identifier and a test_ratio, and it computes a CRC-32 of the identifier to determine whether or not it belongs to the test set. Each identifier has a probability equal to the test_ratio of being assigned to the test set.

Since each identifier is assigned randomly with probability test_ratio, it is possible that the actual test set size will not be exactly equal to the test_ratio, especially for small datasets. If that's a problem, then you can choose to use the other approaches, such as shuffling the dataset and using the first 20% for testing (for example).

Hope this helps.