Open paliwalvarad opened 2 years ago
Hi @paliwalvarad ,
Sure. The CRC-32 algorithm takes the identifier and converts it to a (somewhat) pseudo-random 32-bit number. In other words, we're using the crc32()
function as a hash function: the same identifier will always produce the same CRC-32 "hash", but two different identifiers will generally produce very different (seemingly random) hashes.
The & 0xffffffff
can be removed: it doesn't hurt, but it was only needed in Python 2 to convert the output of crc32()
from a signed 32-bit integer to an unsigned 32-bit integer.
So now we have a 32-bit signed integer. We check whether it's smaller than test_ratio * 2**32
. The largest possible CRC-32 value is 0xffffffff, which is 232 - 1. So if test_ratio
is equal to, say, 0.2, then the pseudo-random CRC-32 will only be smaller than 232 about 20% of the time, on average over all possible identifiers.
In short, this function takes an identifier and a test_ratio, and it computes a CRC-32 of the identifier to determine whether or not it belongs to the test set. Each identifier has a probability equal to the test_ratio
of being assigned to the test set.
Since each identifier is assigned randomly with probability test_ratio
, it is possible that the actual test set size will not be exactly equal to the test_ratio
, especially for small datasets. If that's a problem, then you can choose to use the other approaches, such as shuffling the dataset and using the first 20% for testing (for example).
Hope this helps.
Hi there, in the following code, I fail to understand how the hash function is working here and how would it make sure that the test set is randomly selected. Can you shed some light on this please? Thank you.