Need help understanding crc hash used to explain test train split in Chapter 2

Hi @paliwalvarad ,

Sure. The CRC-32 algorithm takes the identifier and converts it to a (somewhat) pseudo-random 32-bit number. In other words, we're using the crc32() function as a hash function: the same identifier will always produce the same CRC-32 "hash", but two different identifiers will generally produce very different (seemingly random) hashes.

The & 0xffffffff can be removed: it doesn't hurt, but it was only needed in Python 2 to convert the output of crc32() from a signed 32-bit integer to an unsigned 32-bit integer.

So now we have a 32-bit signed integer. We check whether it's smaller than test_ratio * 2**32. The largest possible CRC-32 value is 0xffffffff, which is 232 - 1. So if test_ratio is equal to, say, 0.2, then the pseudo-random CRC-32 will only be smaller than 232 about 20% of the time, on average over all possible identifiers.

In short, this function takes an identifier and a test_ratio, and it computes a CRC-32 of the identifier to determine whether or not it belongs to the test set. Each identifier has a probability equal to the test_ratio of being assigned to the test set.

Since each identifier is assigned randomly with probability test_ratio, it is possible that the actual test set size will not be exactly equal to the test_ratio, especially for small datasets. If that's a problem, then you can choose to use the other approaches, such as shuffling the dataset and using the first 20% for testing (for example).

Hope this helps.

ageron / handson-ml

Need help understanding crc hash used to explain test train split in Chapter 2 #655