ageron / handson-ml

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Apache License 2.0
25.12k stars 12.91k forks source link

Chapter 2: Creating a test set, Stratify #689

Open hady42 opened 1 month ago

hady42 commented 1 month ago

I am kindly asking for clarification in some points regarding Chapter 2.

  1. Why do we need to introduce the random seed? And if it is to have consistent train/test sets over multiple runs, then why do we need to have multiple runs.

  2. If using the hash function will keep the test set consistent, can new instances be included into the test set as the hash value of its id satisfies the condition crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32?

  3. What is the point to use stratified sampling in the first place.

  4. Why cant we just use the normal train_test_split method instead of StratifiedShuffleSplit?

Thank you for your kindness and your time.