Data analysis project for DSCI 522 (Data Science workflows); a course in the Master of Data Science program at the University of British Columbia.
For this project we will attempt to make a hypothesis test to answer the question: is the proportion of athletes younger than 25 that win a medal greater than the proportion of athletes of age 25 or older that win a medal? The age threshold of 25 years old was chosen as this is the median age of the athletes in the data set. We chose this question and this topic since it is a pop culture subject for which we think strong domain is not really needed. It is important to note that the idea for this project is to be able to wrangle the data and test our hypothesis with the tools and techniques that we know how to use at the moment.
The data used in this project is a public domain data set of the olympics with information of athletes like nationality, sport/event, year, age, among others, extracted from the publicly available tidytuesday data sets. Each row in the data set represents information of an athlete competing in a certain event, including information of whether the athlete won a medal or not. The testing results and analysis will be presented in the final report.
To answer the question mentioned, we will perform a hypothesis test for the difference in proportions. First, we will perform an EDA (Exploratory Data Analysis) to get a general idea of how the data looks like and we will show this work in the EDA document for this project.
In the EDA, we will explore the age distribution of athletes, the relationship between athletes' age and medals and correlations of athletes's age and other features (e.g. height and weight).
Given that we are going to perform a hypothesis, we defined our null and alternative hypothesis as follows:
We will use the simulation/permutation technique, and our test statistic will be the difference in proportions. We will check both the p-value and the place where the observed test statistic falls on the null distribution to determine if we can reject our null hypothesis or not. We will use a significance level of alpha = 0.05, and this will be a one-sided test.
The EDA performed and reports for the data set can be found in the src folder in this repo.
The rendered final report is located here
Please install Docker if you have not already done so. Then clone this repository and run the following command at the command line/terminal from the root directory of this project:
docker run --rm --platform linux/amd64 -v $(PWD):/home/data_analysis squisty/olympic_medal_htest@sha256:693464b47beca5240bd52df7040030bfa1be4a66272900cb95b2636db5d27644 make -C /home/data_analysis all
The local repository can be reset to the initial cloned state by running the following from the root directory of this project:
docker run --rm --platform linux/amd64 -v $(PWD):/home/data_analysis squisty/olympic_medal_htest@sha256:693464b47beca5240bd52df7040030bfa1be4a66272900cb95b2636db5d27644 make -C /home/data_analysis clean
Please note that the Dockerfile
pins the versions of all the software packages listed under Dependencies except the R packages, which are instead based on the rocker/tidyverse:4.1
Docker image the R packages of which are the same as our dependencies. Since, however, we are not pinning the versions of those packages, the installed versions of the packages in the Docker image may change in the future. We have chosen not to pin to exact versions of the R packages to avoid bringing in some unnecessary packages and over-complicate this project.
The analysis can be reproduced by following these steps:
Makefile
file with the following command:make all
The local repository can be reset to the initial cloned state by running:
make clean
Please see below for our file dependencies:
The Python environment required to run every python script can be found here, or can be created using:
- Python 3.9.6 and packages:
- docopt=0.6.*
- pandas=1.3.*
- altair=4.1.*
- altair_viewer=0.4.*
- altair_saver=0.5.*
- R 4.1.* and libraries:
- readr=2.1.*
- tidyverse=1.3.*
- here=1.0.*
- knitr=1.36
- pracma=2.3.*
- broom=0.7.*
- infer=1.0.*
- docopt=0.7.*
- kableExtra=1.3.*
- pandoc=2.16.*
- GNU make 3.81
The Olympic Medal H-test project is available under the MIT license. See LICENSE for more information.
Bray, Andrew, Chester Ismay, Evgeni Chasnovski, Simon Couch, Ben Baumer, and Mine Cetinkaya-Rundel. 2021. Infer: Tidy Statistical Inference. https://CRAN.R-project.org/package=infer.
de Jonge, Edwin. 2020. Docopt: Command-Line Interface Specification Language. https://CRAN.R-project.org/package=docopt.
McKinney, Wes. 2010. “Data Structures for Statistical Computing in Python.” In Proceedings of the 9th Python in Science Conference, edited by Stéfan van der Walt and Jarrod Millman, 51–56.
Python Core Team. 2019a. Python: A dynamic, open source programming language. Python Software Foundation. https://www.python.org/.
———. 2019b. Python: A dynamic, open source programming language. Python Software Foundation. https://www.python.org/.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Robinson, David, Alex Hayes, and Simon Couch. 2021. Broom: Convert Statistical Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom.
VanderPlas, Jacob, Brian E Granger, Jeffrey Heer, Dominik Moritz, Kanit Wongsuphasawat, Arvind Satyanarayan, Eitan Lees, Ilia Timofeev, Ben Welsh, and Scott Sievert. 2018. “Altair: Interactive Statistical Visualizations for Python.” Journal of Open Source Software 3 (32): 1057.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Xie, Yihui. 2021. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Zhu, Hao. 2021. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.
Singh, Dr Preet Deep. 2021. “Olympic Medals: Matter of Nerves.” Available at SSRN 3901321. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3901321
Elmenshawy, Ahmed R, Daniel R Machin, and Hirofumi Tanaka. 2015. “A Rise in Peak Performance Age in Female Athletes.” Age 37 (3): 1–8. https://link.springer.com/article/10.1007/s11357-015-9795-8