ageron / handson-ml

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Apache License 2.0
25.18k stars 12.92k forks source link

Couldnt understand the code in chapter-2 while separating test set #567

Open sniray opened 4 years ago

sniray commented 4 years ago

Hi Mr.Aurélien Géron,

In your book while separating the test set you have written. def test_set_check(identifier, test_ratio, hash): return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5): ids = data[id_column] in_testset = ids.apply(lambda id: test_setcheck(id, test_ratio, hash)) return data.loc[~in_test_set], data.loc[in_test_set]

Can you help me to understand how hash helps in separating the test set and avoid the problems mentioned before. In the second book you have used crc32 and the code is as following: from zlib import crc32 def test_set_check(identifier, test_ratio): return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32 How does this equals to the above code?

Rosseel commented 4 years ago

He covered it pretty extensively here : https://github.com/ageron/handson-ml/issues/71