NickCH-K / causaldata

Packages of Example Data for The Effect
131 stars 29 forks source link

Missing columns #3

Closed grantmcdermott closed 2 years ago

grantmcdermott commented 3 years ago

I ran into trouble trying to replicate some of the DiD examples in Scott's book using the abortion dataset. Looks like the causaldata version is missing some columns.

abortion_orig = haven::read_dta("https://raw.github.com/scunning1975/mixtape/master/abortion.dta") 
data("abortion", package = "causaldata")

setdiff(names(abortion_orig), names(abortion))
#>  [1] "totcase"    "rate"       "totrate"    "id"         "black"     
#>  [6] "perc1519"   "perc"       "aids"       "aidscapita" "ac"        
#> [11] "trend"      "tsq"        "blk"        "female"     "t"         
#> [16] "wm15"       "wf15"       "bm15"

Created on 2021-11-15 by the reprex package (v2.0.1)

NickCH-K commented 3 years ago

Hmm, shoot. I definitely did drop some columns from Scott's data sets because they were too big, but I thought I kept all the ones necessary to run all the examples. Which example(s) were you unable to run?

grantmcdermott commented 2 years ago

Hey Nick, the missing t variable triggers an error for this example: https://github.com/scunning1975/mixtape/blob/master/R/abortion_ddd.R#L28-L29

(Between you and me, that model looks over-specified and I'd run it without the t, but it's an example of one that fails using the causaldata version.)

grantmcdermott commented 2 years ago

Update: the t variable is a just a linear time trend so could be substituted with the existing year column. OTOH in the code that Scott's got in the book they're turning this into a factor, so it would probably cause issues. You might also just want to drop it in so people can run his code as-as.

Another one along those lines that I've just bumped into is the castle dataset. It's missing the post column that feeds in here. Again, easy enough to create yourself, but if you're goal is for people to use this as a drop-in replacement...

P.S. There's a slightly different complication related to the fact that these .dta-sourced datasets all have haven_labelled columns. This means that various operations — e.g. demeaning — won't work unless you load the haven too and expose the underlying haven methods. Would you like me to raise that in a separate GH issue?)

NickCH-K commented 2 years ago

The columns have been added, and the haven_labelled designations removed, with a46ece5b2e2cb0d099ac72bf0c39755a6c2d0dc9

This is on the GitHub versions. Will work on pushing these to ssc/CRAN/PyPI soon.