ageron / handson-ml

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Apache License 2.0
25.14k stars 12.91k forks source link

Creating a test set with a hash (Issue 71 was closed) #613

Open minertom opened 3 years ago

minertom commented 3 years ago

Hi, I did read issue #71 "Creating test set with hash" and I only had one question concerning your explanation.

During the hashing, only the last byte of the actual hash is considered as the test in order to determine if the data in question belongs to the test set. Yes, the whole hash is a unique value (unless a collision happens). But, only the last byte 0-255 is used as the determinant of belonging in the data set. So, are you saying that because the hashing algorithm provides a "uniform distribution" that 20% of the values that represent the last byte of the hash will be less than 51 (20% of 256)?

Thank You Tom

BTW, I purchased your book. Love it so far.

ageron commented 3 years ago

Hi @minertom ,

Thanks for your question, and for your kind words (I'm very glad you enjoy my book!).

You guessed right: I'm assuming that the last byte of the hash follows a uniform distribution over all possible byte values, so about 20% will be lower than 51, since 20% is about 51/256. Note that 51/256=19.92%, while 52/256=20.31%, so there's no easy way to get precisely 20% with just one byte. If this granularity is not sufficient, you could convert the whole hash to a very large integer, and check whether it's smaller than 20% of the max possible value. I felt that the added complexity wasn't worth the effort, but as this code has confused quite a few readers, I'm not sure that was a good call.

Anyway, I hope it's clearer now?