Authors: Joshua Sia, Morgan Rosenberg, Sufang Tan, Yinan Guo [Group 25]
Data analysis project for DSCI 522 (Data Science Workflows); a course in the Master of Data Science program at the University of British Columbia.
COVID-19 is a serious pandemic that has introduced a wide variety of challenges since 2019. By analysing the association of certain socioeconomic factors with COVID-19 prevalence, we hope to shed some light onto the societal features that may be associated with a high number of COVID-19 cases. Identifying the socioeconomic factors could also help policymakers and leaders make more informed decisions in combatting COVID-19.
Here, we attempt to build a multiple linear regression model which is used to quantify the influence of socioeconomic factors on the COVID-19 prevalence (measured by cases per 100,000 population) among all US counties. Factors such as percentage of smokers, income ratio, population density, percent unemployed, etc. are explored. Our final regression model suggests that the percentage of smokers, teenage birth rates, unemployment rate, and other interaction terms are significantly associated with COVID-19 prevalence at the 0.05 level. However, the original data set contained over 200 features and a subset of these features were chosen arbitrarily which means that there is still room to explore other socioeconomic features that are significantly associated with COVID-19 prevalence.
The final report can be read as a markdown file here, or a html file here.
To replicate the analysis, please have a kaggle.json
file containing your Kaggle credentials at the project root. To obtain your Kaggle credentials, follow the instructions on Kaggle.
There are two suggested ways to run this analysis:
note - the instructions in this section also depends on running this in a unix shell (e.g., terminal or Git Bash)
To replicate the analysis, install Docker. It may also be necessary to allocate more memory to the Docker container. To do this, open the Docker application, enter Settings, click on the Resources tab, and increase the Memory allocated using the slider. Please also refer to Docker Desktop for Windows user manual and Docker Desktop for Mac user manual for more information.
To pull the Docker image from Docker Hub, run the following command:
docker pull alexyinanguo/us_social_determinants_of_health_by_county
Clone this GitHub repository and run the following command at the command line/terminal
from the root directory of this project (Mac M1 users should add the flag and value --platform linux/amd64
; Windows users should use //
in the path):
docker run --rm -v /$(pwd):/home/rstudio/determinants_of_health alexyinanguo/us_social_determinants_of_health_by_county make -C /home/rstudio/determinants_of_health all
To reset the project to a clean state with no intermediate files, run the following command at the command line/terminal from the root directory of this project (Mac M1 users should add the flag and value --platform linux/amd64
; Windows users should use //
in the path):
docker run --rm -v /$(pwd):/home/rstudio/determinants_of_health alexyinanguo/us_social_determinants_of_health_by_county make -C /home/rstudio/determinants_of_health clean
To replicate the analysis, clone this GitHub repository, install the dependencies listed below, and run the following command at the command line/terminal from the root directory of this project:
make all
To reset the project to a clean state, with no intermediate or results files, run the following command at the command line/terminal from the root directory of this project:
make clean
The dependency diagram of the Makefile is shown below.
The US social determinants of health by county data set is licensed under CC0 Public Domain.
Davis, J. (2020, December 5). US social determinants of health by county. Kaggle. Retrieved December 2, 2021, from https://www.kaggle.com/johnjdavisiv/us-counties-covid19-weather-sociohealth-data.