briatte / srqm

An introductory statistics course for social scientists, using Stata
49 stars 17 forks source link

Many QOG variables have low-N for cross-sectional analysis #29

Open briatte opened 3 years ago

briatte commented 3 years ago
d <- haven::read_dta('/Users/fr/Documents/Teaching/SRQM/data/qog2019.dta')

  var = names(d),
  # data sources
  src = str_extract(names(d), ".*?_"),
  n = apply(d, 2, function(x) sum(!
) %>% 
  group_by(src) %>% 
  summarise(n_vars = n(), min_N = min(n), max_N = max(n)) %>%
  arrange(min_N) %>% 
  # arbitrary threshold at N = 50
  filter(!, min_N < 50) %>% 
  print(n = 100)

PSI, EU, OECD, WWBI and a few others are particularly at fault:

# A tibble: 28 x 5
   src     n_vars min_N med_N max_N
   <chr>    <int> <int> <dbl> <int>
 1 psi_         6     1  10.5    20
 2 mad_         4    15  29     163
 3 eu_        277    16  34      48
 4 une_        47    16 146     193
 5 wwbi_       38    17  41      62
 6 oecd_      281    19  37      44
 7 wdi_       278    19 156     192
 8 dev_         4    20  20      20
 9 dpi_        70    26 160.    175
10 bs_          8    28  28      28
11 ess_         9    28  28      28
12 ideavt_      6    28 107     180
13 wel_        36    29  32     189
14 wvs_        42    29  34      34
15 aid_         6    31 139     139
16 cses_        2    31  31.5    32
17 gol_        20    33 127     129
18 wiid_       18    34  35      35
19 ucdp_        2    35  70     105
20 cpds_       49    36  36      36
21 h_          11    37 165     185
22 lis_        23    37  37      37
23 r_           5    40  98     144
24 sgi_        29    41  41      41
25 top_         2    41  41      41
26 nelda_      10    44  45      45
27 vi_         13    45  48      50
28 qs_          9    47 112     115

Not a bug, but leads students to build designs with low sample sizes.