Count data: From basic probability theory to regression models

zeileis commented 2 years ago

Hi Alex @alexpghayes,

Moritz @mnlang asked me whether I knew a good real data set that could be utilized to illustrate going from basic probability theory to probabilistic regression models (as done with artificial data in https://github.com/alexpghayes/distributions3/pull/71#issuecomment-1032581188). Hence, I prepared a data set with goals from all matches in the last FIFA World Cup 2018.

In this pull request I included the data set itself and the corresponding documentation, including a worked example. Basically, the idea is that the goals in a match for each team can be fitted well by a Poisson distribution. In a first step, one can simply assume that the mean number of goals is constant across all matches, leading to a simple Poisson fit. In a second step, a Poisson regression model is used that adds the ability difference of the two teams involved as a regressor. Based on this expected goal probabilities for matches can be computed.

I like this example very much because a relatively simple model fits the data well - and it is also very clear that probabilistic predictions are of interest here and not just means. I would also be willing to write a vignette based on this case study, geared towards beginners that want to learn how to use the package. But I thought that before writing a full vignette, I would send you this PR to see whether you think this would be a useful addition to the package.

Thanks in advance for your consideration & best wishes, Achim

alexpghayes commented 2 years ago

This looks great, and an a vignette sounds fantastic!

Looks like some of the examples break with older versions of R?

zeileis commented 2 years ago

Thanks for the quick feedback and sorry for not thinking about introducing a dependency on R >= 4.0.0 by using proportions() rather than the old prop.table(), changed now.

Finally regarding the vignette: Do you have any guidance on style/format and/or content? I would use the same YAML configuration as in the other vignettes and try to keep it very much hands-on. Any other guidelines or do's and don'ts?

alexpghayes commented 2 years ago

Finally regarding the vignette: Do you have any guidance on style/format and/or content? I would use the same YAML configuration as in the other vignettes and try to keep it very much hands-on. Any other guidelines or do's and don'ts?

To be frank, the style of the vignettes is "reduce the number of questions I get in stat 101 office hours to an absolute minimum," and to accomplish this by walking through calculations and the logic behind them in as much detail as possible. The goal being partially to justify procedures but even more so to provide a template that can be easily followed for similar computations.

I not sure this is terribly helpful guidance, but hopefully it's a start. I am also happy to revise/suggestion revisions once we have a vignette draft.

alexpghayes commented 2 years ago

Looks like we need to re-document to clear out the old example code from .Rd files still.

zeileis commented 2 years ago

Of course...sorry for missing this. I'm not a regular roxygen user.

alexpghayes commented 2 years ago

No worries! And thanks!

alexpghayes / distributions3

Count data: From basic probability theory to regression models #73