Open anthol42 opened 3 years ago
crc32(np.int64(identifier))
returns 64 bit signed int, bitwise AND with 0xffffffff
convert it to 32 bit unsigned int
could someone please elaborate on this function.
crc32(np.int64(identifier))
is used to compute hash value, and doing & 0xffffffff
normalises the value (source:). why are we comparing with test_ratio*2**32
. how will this condition will ever become false.
The goal of this function is to shuffle the instances and put them in the test set (True) or train set (False) in a stable way (i.e., deterministic), based on their identifier.
The CRC32 algorithm is a simple way to achieve this. It's not really meant for generating pseudo-random numbers, to be honest, so it's not ideal for shuffling, but in general it will do fine, and the good thing about it is that it is super fast.
Python's crc32()
function takes a bytes string (or an object which can be converted into a bytes string), and it outputs a pseudo-random 32-bit integer. In Python2, it outputs a signed 32-bit integer:
# Python 2
>>> from zlib import crc32
>>> crc32("abcd")
-310194927
In Python 3, it outputs an unsigned 32-bit integer:
# Python 3
>>> from zlib import crc32
>>> crc32(b"abcd")
3984772369
If we want our code to work the same way in Python 2 and Python 3, we must convert the signed 32-bit integer into an unsigned 32-bit integer. This can be done by masking with & 0xffffffff
# Python 2
>>> crc32("abcd") & 0xffffffff
3984772369
In Python 3, since the output of crc32()
is already an unsigned 32-bit integer, the masking does not do anything at all:
# Python 3
>>> from zlib import crc32
>>> crc32(b"abcd")
3984772369
>>> crc32(b"abcd") & 0xffffffff
3984772369
Ok, so this explains the masking: it's just a Python 2/3 compatibility hack. If you are only using Python 3 (and in 2021, you really should), you can safely get rid of & 0xffffffff
.
Note that we cannot call crc32(identifier)
directly, because the identifier is a Python integer, which cannot automatically be converted to a bytes string. This is why we call crc32(np.int64(identifier))
, since NumPy knows how to convert an int64
to a bytes string.
So far so good, we know why we use crc32(np.int64(identifier)) & 0xffffffff
. Now what about < test_ratio*2**32
?
Well recall that crc32()
will output a pseudo-random 32-bit unsigned integer. So basically a pseudo-random integer between 0
and 2**32-1
. So there's roughly a 10% chance that this pseudo-random number will be lower than 10% of 2**32
. So if test_ratio=0.1
, then about 10% of all possible identifiers will have a CRC32 which is smaller than test_ratio*2**32
.
I hope this helps, Aurélien
Hi, I'm not a pro to python and I am wondering why we use bitwise and on page 52.
could someone explain me the purpose of this? Thank you.