gedeck / practical-statistics-for-data-scientists

Code repository for O'Reilly book
GNU General Public License v3.0
2.72k stars 1.74k forks source link

chi-square, resampling approach #60

Open frahimov opened 6 months ago

frahimov commented 6 months ago

Hi, I hope it is OK that I am commenting on this here. In chapter 3 I am stuck at this step: 3. Find the squared differences between the shuffled counts and expected counts then sum them. Do you mean "calculate chi-square statistics" for each resampled sample set, where you calculate Pearson residuals first, or you just literally sum the squared differences between observed and expected counts? Thank you.

gedeck commented 6 months ago

Hello, thank you for your feedback. This is a good place for general questions. This one you probably could have added to the errata page on the O'Reilly website.

We meant to calculate the chi-square statistic in step 3, and that is what is in the code. That said, you can also just use the sum of resample squared differences in step 3, provided that you then compare it to the observed sum of squared differences in step 5. The chi-square statistic was developed before the computer age, when it was convenient to have a standardized test statistic that could be compared to standard tables in textbooks. The chi-square statistic and the sum of squared differences are two different (but related) ways to measure the difference between observed and expectation.

frahimov commented 6 months ago

Hi. Thank you for your response. Actually, I was referring to your book and did not know that there was an errata page on the O'Reilly website. I will post there if I have more questions as I go through the chapters.