ageron / handson-ml2

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
27.72k stars 12.71k forks source link

Chapter 2-Why use bitwise and with crc32 #360

Open anthol42 opened 3 years ago

anthol42 commented 3 years ago

Hi, I'm not a pro to python and I am wondering why we use bitwise and on page 52.

def test_set_check(identifier, test_ratio):
        return crc32(np.int64(identifier)) & 0xffffffff < test_ratio*2**32

could someone explain me the purpose of this? Thank you.

beam36 commented 3 years ago

crc32(np.int64(identifier)) returns 64 bit signed int, bitwise AND with 0xffffffff convert it to 32 bit unsigned int

mshsingh772 commented 3 years ago

could someone please elaborate on this function. crc32(np.int64(identifier)) is used to compute hash value, and doing & 0xffffffff normalises the value (source:). why are we comparing with test_ratio*2**32. how will this condition will ever become false.

ageron commented 3 years ago

The goal of this function is to shuffle the instances and put them in the test set (True) or train set (False) in a stable way (i.e., deterministic), based on their identifier.

The CRC32 algorithm is a simple way to achieve this. It's not really meant for generating pseudo-random numbers, to be honest, so it's not ideal for shuffling, but in general it will do fine, and the good thing about it is that it is super fast.

Python's crc32() function takes a bytes string (or an object which can be converted into a bytes string), and it outputs a pseudo-random 32-bit integer. In Python2, it outputs a signed 32-bit integer:

# Python 2
>>> from zlib import crc32
>>> crc32("abcd")
-310194927

In Python 3, it outputs an unsigned 32-bit integer:

# Python 3
>>> from zlib import crc32
>>> crc32(b"abcd")
3984772369

If we want our code to work the same way in Python 2 and Python 3, we must convert the signed 32-bit integer into an unsigned 32-bit integer. This can be done by masking with & 0xffffffff

# Python 2
>>> crc32("abcd") & 0xffffffff
3984772369

In Python 3, since the output of crc32() is already an unsigned 32-bit integer, the masking does not do anything at all:

# Python 3
>>> from zlib import crc32
>>> crc32(b"abcd")
3984772369
>>> crc32(b"abcd") & 0xffffffff
3984772369

Ok, so this explains the masking: it's just a Python 2/3 compatibility hack. If you are only using Python 3 (and in 2021, you really should), you can safely get rid of & 0xffffffff.

Note that we cannot call crc32(identifier) directly, because the identifier is a Python integer, which cannot automatically be converted to a bytes string. This is why we call crc32(np.int64(identifier)), since NumPy knows how to convert an int64 to a bytes string.

So far so good, we know why we use crc32(np.int64(identifier)) & 0xffffffff. Now what about < test_ratio*2**32?

Well recall that crc32() will output a pseudo-random 32-bit unsigned integer. So basically a pseudo-random integer between 0 and 2**32-1. So there's roughly a 10% chance that this pseudo-random number will be lower than 10% of 2**32. So if test_ratio=0.1, then about 10% of all possible identifiers will have a CRC32 which is smaller than test_ratio*2**32.

I hope this helps, Aurélien