briatte / dsr

Introduction to Data Science with R (Sciences Po, Paris, 2023)
https://f.briatte.org/teaching/syllabus-dsr.pdf
44 stars 9 forks source link
course data-analysis data-science data-visualization r statistics

> Introduction to Data Science with R

François Briatte
Spring 2023. Work in progress.

An introduction to data science with R, RStudio, and the {tidyverse} packages, aimed at social scientists with little to no training in statistical computing and related topics.

> Syllabus
> Readings (handbooks, videos, tutorials and more)

This folder contains the code, data and documentation of the examples used either during the practice sessions in class, or distributed as homework exercises. Slides and exercise solutions are not included.

Outline

  1. Software
  2. Workflow
  3. Data
  4. Visualization
  5. Description
  6. Association
  7. Correlation
  8. Regression
  9. Nonlinearity
  10. Surveys
  11. Classification
  12. Extensions

Bonus sections:

Part 1. Basics

Software setup, first steps with coding, handling data, and plotting things.

1. Software

> Readings
> Exercise 1: Generative art

2. Workflow

> Readings
> Demo 1: Cholera deaths in London, 1854 (John Snow)
> Demo 2: Industrial disputes and left-wing seat shares (CWS 2020)
> Exercise 2: Weird R syntax

3. Data

Data wrangling, mostly with the {dplyr} package.

> Readings
> Demo 1: Covid-19 and global income inequality (Deaton)
> Demo 2: Visualizing the 'EU mood' (Guinaudeau and Schnatterer)
> Exercise 3: Satisfaction with democracy in Hungary and Poland (Eurobarometer)

4. Visualization

Plots, mostly with the {ggplot2} package.

> Readings
> Demo: Economic growth and public debt (Reinhart and Rogoff)
> Bonus 1: Mapping life expectancy worldwide
> Bonus 2: Anscombe's quartet
> Exercise 4: Life expectancy and GDP per capita (Preston curve)

Part 2. Essentials

Descriptive and inferential statistics, the frequentist way (no time for Bayesian ones, I'm afraid). This section will briefly mention some more advanced topics related to regression models, statisical estimation and machine learning.

5. Description

Summary statistics and distributions. Also covering sampling, and possibly bootstrap resampling if time permits (which of course won't happen).

> Readings
> Demo: Colonialism, democracy, life expectancy and wealth, Part 1
> Exercise 5: Trust in Islamist parties (graded homework)

6. Association

Statistical tests to compare means and proportions.

> Readings
> Demo: Colonialism, democracy, life expectancy and wealth, Part 2
> No exercise this week -- catch up on all previously distributed material

7. Correlation

Linear and nonlinear, as an introduction to linear and nonlinear models, with some basic philosophy of data quantitative social statistical science.

> Readings
> Demo: Social democratic capitalism (Kenworthy)
> Exercise 7: US Republican vote shares and life expectancy (Case and Deaton)

8. Regression

Linear regression, the full package: least squares, dummies, interactions, diagnostics, marginal effects. All in one session, if things go well, but this usually takes half of any introductory statistics course.

> Readings
> Demo: U.S. presidential election outcomes and income growth (Bartels)
> Exercise 8: Growth forecasts and fiscal consolidation (IMF/Giles)

9. Nonlinearity

Focusing mostly exclusively on logistic regression, but hoping to also introduce more fun stuff with no time to say more about other generalized models.

> Readings
> Demo: Opposition to abortion in Canada (CES 2021)
> Exercise 9: Predicting Covid-19 lockdowns (graded homework)

10. Surveys

Surveys, and how to handle survey weights, with the {survey} and {srvyr} packages. Not yet online, work in progress.

> Readings
> Demo: ..
> Exercise 10: Economic insecurity and religious reassurance (ESS)

Part 3. Extras

Statistical learning and machine learning could go here, as well as APIs and Web scraping, networks, big data and more things like JavaScript visualization libraries, but there are only two extra sessions.

11. Classification

Dimensionality reduction, principal components, clustering and partitioning, using {factoextra} and related packages to visualise the results.

> Readings
> Demo 1: Protein consumption in European countries, 1973
> Demo 2: Feelings towards politicians in France (CNEP 2017)
> No exercise this week -- catch up on all previously distributed material

12. Extensions

Students manifested an interest in maps and text, so let's cover this, before closing on mentions of other useful things.

> Readings
> Demo 1: Mapping support for fossil fuel taxation (ESS)
> Demo 2: Mining into Greta Thunberg's speeches
> Exercise 12: data science skills


Dependencies

The course runs on R 4.x and depends on the following packages:

install.packages("remotes")

# required for multiple sessions
pkgs <- c("broom", "countrycode", "e1071", "ggmosaic", "ggeffects", "ggrepel", 
          "moments", "performance", "sf", "texreg", "tidyverse", "WDI")
remotes::install_cran(pkgs)

# required for Session 11 only
s11 <- c("car", "corrr", "factoextra", "ggcorrplot", "ggfortify", "plotly")
remotes::install_cran(s11)

# required for Session 12 only
s12_maps <- c("gstat", "stars")
s12_text <- c("igraph", "ggraph", "pdftools", "tidytext")
remotes::install_cran(c(s12_maps, s12_text))

# optional (used to prepare the course datasets)
xtra <- c("rvest")
remotes::install_cran(xtra)

Credits

The last time I had a chance to build such a course was ten years ago, with Ivaylo D. Petev. Some of the inspiration for this course dates back to that time.

In the meantime, I have taught a few other quantitative methods courses, including some tutorials and guest lectures for Jan Rovny's own courses. Some of the material for this course comes from those other ones.

Some thanks go to Kim Antunez, who will be soon teaching her own version of this course, and who suggested some of the readings that made it to my own list.

Some thanks also go to Joël Gombin and Timothée Gidoin, who inspired and helped with a first draft of this course, six years before it actually ran for the first time.

Last, this course and all the other ones mentioned above took place at Sciences Po in Paris, France, where some more inspiration has come from Emiliano Grossman and many others.

The ASCII art in some scripts is by Patrick Gillespie.

Elsewhere

Most of this course is available on GitHub, where a wiki page lists other similar courses. I would love it if the present course were as good as those listed there, but cannot guarantee it.