briatte / dsr

Introduction to Data Science with R (Sciences Po, Paris, 2023)
https://f.briatte.org/teaching/syllabus-dsr.pdf
42 stars 9 forks source link

Surveys - ESS #33

Closed briatte closed 1 year ago

briatte commented 1 year ago

This one is complex enough to be its own issue…

Weighting guide

https://www.europeansocialsurvey.org/methodology/ess_methodology/data_processing_archiving/weighting.html https://www.europeansocialsurvey.org/docs/methodology/ESS_weighting_data_1_1.pdf

From the weighting guide, v1.1 (2020), page 7:

From round 9 onwards, all the necessary sample design indicators and weights are already included in the integrated (second release) data file, but if you are working with data from earlier rounds you will first need to merge the sample design indicators on to the main data file. For rounds 7 and 8, the sample design indicators are in the integrated SDDF (sample design data file), so you need to merge this file with the main integrated (questionnaire data) file. For rounds 1 to 6, sample design indicators are stored in a separate file for each country (and files are missing for some countries in some rounds), so you would need to merge several files. Furthermore, for these rounds the indicators psu and stratify have not been recoded in a manner suitable for cross-country analysis, so you will need to do this if you are analysing data from more than one country. Follow the guidance in section 2 of Kaminska & Lynn (2017) and ensure that each value is exclusive to one country.

The guide asks for the creation of anweight ('analytical weights') from the following variables:

# R, data.table syntax
data1[, anweight := pspwght * pweight * 10e3]
# Stata
# gen anweight=pspwght*pweight

Once anweight exists, weighting guide instructs the following design:

# R
svydesign(ids = ~psu, strata = ~stratum, weights = ~anweight, data = data1)
# Stata
# svyset psu [pweight=anweight], strata(stratum)

Details on analytical weights (ESS9+)

Quoting again from the weighting guide:

It is constructed by first deriving the design weight, then applying a post-stratification adjustment, and then a population size adjustment. Further details of how the weights are derived are documented in the round-specific report on the production of weights. Starting from Round 9, anweight is provided for you in the integrated data file. If you are using data from earlier ESS rounds, you can derive anweight yourself.

Full range of weighting variables, quoted from ESS9 codebook:

Notes:

Discussions

https://github.com/InductiveStep/R-notes/issues/1 https://github.com/ropensci/essurvey/issues/39 https://github.com/ropensci/essurvey/issues/9#issuecomment-502459202

Second link right above recommends the following for ESS4:

svydesign(
  ids = ~ psu + idno, # further comment at the link: specifying just `psu` would be enough
  strata = ~ stratify,
  weights = ~ dweight,
  nest = TRUE,
  data = ess4gb
)

Example: Andi Fugard, ESS9

Intermediate Quantitative Social Research, Birkbeck, University of London (2017-2020) https://inductivestep.github.io/R-notes/complex-surveys.html

Working on a multi-country example:

# using srvyr
as_survey_design(
  ids = idno, # instead of `psu` or `psu + idno` because `psu` is not in ESS9?
  strata = cntry,
  nest = TRUE,
  weights = pspwght
)

From the text:

The nest option takes account of the ids being nested within strata: in other words the same ID is used more than once across the dataset but only once in a country.

Example: Federico Vegetti, ESS7

Introduction to Survey Statistics, University of Heidelberg, 2018 https://federicovegetti.github.io/teaching/heidelberg_2018/lab/sst_lab_day2.html

When working on countries separately:

# using srvyr
as_survey_design(weights = c(dweight, pspwght)) %>%
  group_by(cntry) %>%
  # etc.

# ... doesn't pspwght include dweight?
# ... what about stratum? psu?

When working on all countries together:

# using srvyr
as_survey(weights = c(dweight, pspwght, pweight))

Example: Daniel Oberski, ESS7

http://asdfree.com/european-social-survey-ess.html

Working on a single country (Belgium) after merging the data to the SDDF file:

svydesign(
  ids = ~psu ,
  strata = ~stratify,
  probs = ~prob,
  data = ess_df
)
briatte commented 1 year ago

ESS now featured in Session 12 via a spatial viz example.

briatte commented 1 year ago
briatte commented 1 year ago
z <- fs::dir_ls(regexp = "*.zip", recurse = TRUE)
v <- tibble()
for (i in z) {

  cat(fs::path_file(i))
  d <- unzip(i, exdir = tempdir())
  f <- str_subset(d, "dta$")
  cat(" ->", fs::path_file(f), "...\n")
  d <- haven::read_dta(f)
  n <- names(d)
  n <- n[ n %in% c("essround", "cntry", "psu", "idno", "stratify", "stratum",
                   "dweight", "pspwght", "pweight", "prob", "anweight") ]
  v <- bind_rows(v, tibble(file = f, n))

}

v %>% 
  mutate(file = fs::path_file(file)) %>% 
  pivot_wider(values_from = n, names_from = n) %>% 
  mutate(essround = as.integer(str_extract(file, "\\d+"))) %>% 
  arrange(essround)
# A tibble: 14 × 11
   file          essround idno  cntry dweight pspwght pweight anweight prob  stratum psu  
   <chr>            <int> <chr> <chr> <chr>   <chr>   <chr>   <chr>    <chr> <chr>   <chr>
 1 ESS1e06_6.dta        1 idno  cntry dweight pspwght pweight NA       NA    NA      NA   
 2 ESS4AT.dta           4 idno  cntry dweight pspwght pweight NA       NA    NA      NA   
 3 ESS4LT.dta           4 idno  cntry dweight pspwght pweight NA       NA    NA      NA   
 4 ESS4e04_5.dta        4 idno  cntry dweight pspwght pweight NA       NA    NA      NA   
 5 ESS5ATe1_1.d…        5 idno  cntry dweight pspwght pweight NA       NA    NA      NA   
 6 ESS5e03_4.dta        5 idno  cntry dweight pspwght pweight NA       NA    NA      NA   
 7 ESS6e02_5.dta        6 idno  cntry dweight pspwght pweight anweight NA    NA      NA   
 8 ESS7SDDFe1_2…        7 idno  cntry NA      NA      NA      NA       prob  stratum psu  
 9 ESS7e02_2.dta        7 idno  cntry dweight pspwght pweight NA       NA    NA      NA   
10 ESS8SDDFe01_…        8 idno  cntry NA      NA      NA      NA       prob  stratum psu  
11 ESS8e02_2.dta        8 idno  cntry dweight pspwght pweight anweight NA    NA      NA   
12 ESS9ROe01.dta        9 idno  cntry dweight pspwght pweight anweight prob  stratum psu  
13 ESS9e03_1.dta        9 idno  cntry dweight pspwght pweight anweight prob  stratum psu  
14 ESS10.dta           10 idno  cntry dweight pspwght pweight anweight prob  stratum psu
briatte commented 1 year ago

Did some more tests, found weird things: https://github.com/gergness/srvyr/issues/157

Best guess, based on weighting guide:

as_survey_design(ids = psu,
                 strata = c(cntry, stratum),
                 nest = TRUE,
                 weights = anweight)
briatte commented 1 year ago

More tests with other designs. Conclusions:

library(srvyr)
library(tidyverse)

ess9 <- readr::read_rds("https://f.briatte.org/temp/ess9_extract.rds")

# Andy Fugard's design
ess9_af1 <- ess9_extract %>%
  as_survey_design(ids = idno, strata = cntry, nest = TRUE,
                   weights = pspwght)
# Fugard, using PSU
ess9_af2 <- ess9_extract %>%
  as_survey_design(ids = psu, strata = cntry, nest = TRUE,
                   weights = pspwght)

# weighting guide + cntry
ess9_wg1 <- ess9_extract %>%
  as_survey_design(ids = psu,
                   strata = c(cntry, stratum), # adding cntry
                   nest = TRUE,
                   weights = anweight)

# weighting guide, no cntry
ess9_wg2 <- ess9_extract %>%
  as_survey_design(ids = psu,
                   strata = stratum, # as recommended
                   nest = TRUE,
                   weights = anweight)

# Vegetti's design -- implicit `ids = idno`
ess9_mv1 <- ess9_extract %>%
  as_survey_design(weights = c(dweight, pspwght))
# Vegetti, using PSU
ess9_mv2 <- ess9_extract %>%
  as_survey_design(ids = psu, weights = c(dweight, pspwght))

# Oberski's design -- implicit `nest = TRUE`
ess9_do <- ess9_extract %>%
  as_survey_design(ids = psu, strata = stratum, weights = prob)

# Stefan Zins' design
# https://github.com/ropensci/essurvey/issues/39#issuecomment-507855290
ess9_sz <- ess9_extract %>%
  as_survey_design(ids = psu, strata = stratum, weights = dweight)

# results -----------------------------------------------------------------

list("AF_idno" = ess9_af1, "AF_psu" = ess9_af2,
     "WG_cntry" = ess9_wg1, "WG_stratum" = ess9_wg2,
     "MV_idno" = ess9_mv1, "MV_psu" = ess9_mv2, "DO_psu" = ess9_do,
     "SZ_psu" = ess9_sz) %>%
  map_dfr(
    ~ .x %>%
      filter(cntry == "GB") %>%
      group_by(wltdffr_group) %>%
      summarise(prop = srvyr::survey_mean(vartype = "se")),
    .id = "design"
  ) %>%
  filter(wltdffr_group == "Fair") %>%
  arrange(-prop_se)
# A tibble: 8 × 4
  design     wltdffr_group  prop prop_se
  <chr>      <fct>         <dbl>   <dbl>
1 MV_psu     Fair          0.200  0.0204
2 MV_idno    Fair          0.200  0.0166
3 WG_cntry   Fair          0.196  0.0128
4 AF_psu     Fair          0.196  0.0128
5 WG_stratum Fair          0.196  0.0125
6 SZ_psu     Fair          0.190  0.0116
7 DO_psu     Fair          0.191  0.0104
8 AF_idno    Fair          0.196  0.0102
briatte commented 1 year ago

Availability of weighting vars:

… so, use ESS 9 or 10 in examples, or use 7 or 8 for one more example of a merge.