Dataset updates - Githubissues

Closes #21, #22 and #23 (copied below), #27.

Update from 2023

Stop updating the data, really.

[ ] 'Freeze' as it is ~~(except for ESS, perhaps)~~
[ ] Archive the original freezed datasets/codebooks in data-raw/
[ ] Update srqm_data to use data-raw/
[ ] Slightly improve the _readme documents
- [ ] Document freezes
- [ ] Document codebook issues, e.g. #27
- [ ] Ideally, this would be in the Stata Guide…
[ ] Add WEP? #24

Detailed notes

QOG: ~~qog2023~~ -- since QOG 2023 is out
- freeze: qog2019
- would require rewriting code and looking at less clear results… see code at end of section
- only advantage would be lower codebook size → just downsample the 2019 one, it only loses the intra-doc links
- note the codebook issue! #27
- Perhaps simply drop the eu_* variables
GSS: ~~gss7221~~ -- since GSS has updated too
- freeze: gss7616 (but see below)
- not fun to keep only one year: keep ~~older years~~ one old year too
- ~~possibly break down single data into yearly ones?~~ restrict to 1976 and 2016
- would solve "max 2,048 vars" issue from #28
- ~~raises question as to how to zip it all (currently uses gss7616* to match files)~~
ESS: ~~ess2008~~ -- in order to continue using torture question?
- freeze: ess0816, or ess2008 and ess2016 (different codebooks, so it's fine)
- keep using Round 4 for both torture example and health services ones (results are not as clear-cut with Round 8(
- keep Round 8 to cover e.g. climate change
- problem: DTA file is too large -- divide, to avoid _merge problem
- document existence of ess2016 despite not in use anywhere in the course do-files
WVS: wvs9904 -- keep old version for sharia law question
- update to last version, check encoding
- possibly also include a more recent wave? (raises same question as ess2016)
NHIS: update to ~~nhis202* recent year~~ nhis1020?
- check if sampling frame and variables have changed first
- see below on how URL structure for fetching has changed

Note on QOG -- offers only this as a replacement in 2023, which is not ideal:

// school life expectancy
sc wdi_fertility wef_lse, ms(i) mlab(ccodealp) || lfit wdi_fertility wef_lse, ///
    name(g1, replace)
// linear fit + SSA data points only, underpredicted
sc wdi_fertility wef_lse if ht_region == 4, ms(i) mlab(ccodealp) || ///
    lfit wdi_fertility wef_lse, ///
    name(g2, replace)
// all regions
forv i = 1/10 {
    sc wdi_fertility wef_lse if ht_region == `i', ms(i) mlab(ccodealp) || ///
    lfit wdi_fertility wef_lse, ///
    name("region`i'", replace)
}

The plan for 2021:

[ ] Redraw table of use in do-files, to check they are all used a fair number of times.
- Students actually need this to see the data in use.
[ ] Update QOG to January 2021 (2017± 3 years). This will also fix a codebook issue (#27).
- https://www.gu.se/en/quality-government/qog-data/data-downloads/standard-dataset
[ ] Update GSS to 2018.
- http://gss.norc.org/get-the-data/stata
- Nice example use: https://kieranhealy.org/blog/archives/2019/03/22/a-quick-and-tidy-look-at-the-2018-gss/
- Have only one year? Also include e.g. 2008?
- [ ] Rewrite week12.do.
[ ] Update ESS
- https://www.europeansocialsurvey.org/data/round-index.html
- [ ] Round 9 (2018) is out.
- [ ] Check results on week6.do (which uses Round 4 only right now, despite trrtort also existing for Round 8).
- [ ] Have only Round 4? (interview dates 2008–2010)
- [ ] Call it ess0810 — note: in previous course versions, ess0810 contained Rounds 4 (2008) and 5 (2010)
[ ] Update NHIS to 2010 + 2019 (?)
- https://www.cdc.gov/nchs/nhis/2019nhis.htm
- Year 2019 is out, BUT filenames differ —
- Names in 2019: ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Program_Code/NHIS/2019/
- Names in 2018 (and before up to 2010): ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Program_Code/NHIS/2018/
[ ] Update WVS Round 4 to 2020 version
- [ ] Check results and encoding issue in variable label (update: still there in Stata 13, not Stata 14+).

Additional things to consider:

Dataset names

I like the initial "acronym + year" convention, but it produces strange names for multiple-year survey datasets:

ess1214 (not used) and ess0816
wvs9904 (unavoidable)
nhis1017 (unavoidable, unless we use a single year, but that removes any demo of keep if year)
gss7616 (unavoidable, unless we separate the years)

Merged datasets

Is it still a good idea to do that for e.g. ESS? Probably not, esp. if we need to limit datasets at 2,048 variables for Stata/IC.

[ ] Keep NHIS with multiple years. Use it to demo keep if year.
[ ] Keep WVS with multiple years (country-dependent).
[ ] Break down GSS.
[ ] Break down ESS.

Both WVS and ESS are used to demo keep if inlist(country, …), the other subset we want to show.

Additional datasets

It would make a lot of sense to have more datasets for the students to use than those used in the do-files.

Currently, the do-files are selective anyway: we provide ESS 2016 (Round 8) but do not use the data, even though the dependent variable also exists in that round.

GSS has a single codebook, so bundling many years would duplicate the codebook in the ZIP archives. Not ideal.
ESS could be broken down to Rounds 4 (2008), 8 (2016) and 9 (2018).

briatte / srqm

Dataset updates #30

Update from 2023

The plan for 2021:

Dataset names

Merged datasets

Additional datasets