IQSS / Zelig

A statistical framework that serves as a common interface to a large range of models
http://zeligproject.org
109 stars 43 forks source link

Setting a range of x #164

Open christophergandrud opened 7 years ago

christophergandrud commented 7 years ago

Before I go digging through the code, I was wondering if I'm missing something obvious.

I would like to setx for a range of values of a variable. This works fine using the Zelig4 wrappers, e.g.:

z4 <- zelig(Fertility ~ Education, data = swiss, model = 'ls')
z4_set <- setx(z4, Education = 5:15)
z4_sim <- sim(z4, z4_set)
ci.plot(z4_sim)

I've tried to do something similar with:

z5$setx(Education = 5:15)

but get the error:

 Error in model.frame.default(object, data, xlev = xlev) : 
  variable lengths differ (found for 'Education')

and with

z5$setrange(Education = 5:15)

This doesn't return any errors, but the summary is:

range:
  (Intercept) Education
1           1         5

Next step: Use 'sim' method

What am I missing here?

cchoirat commented 7 years ago

I can't replicate:

library(Zelig)
z5 <- zls$new()
z5$zelig(Fertility ~ Education, data = swiss)
z5$setrange(Education = 5:10)
z5$sim()
ci.plot(z5)

image

> z5$setx.out
$range
$range[[1]]
Source: local data frame [1 x 2]
Groups: <by row>

# A tibble: 1 × 2
     by            mm
* <dbl>        <list>
1     1 <dbl [1 × 2]>

$range[[2]]
Source: local data frame [1 x 2]
Groups: <by row>

# A tibble: 1 × 2
     by            mm
* <dbl>        <list>
1     1 <dbl [1 × 2]>

$range[[3]]
Source: local data frame [1 x 2]
Groups: <by row>

# A tibble: 1 × 2
     by            mm
* <dbl>        <list>
1     1 <dbl [1 × 2]>

$range[[4]]
Source: local data frame [1 x 2]
Groups: <by row>

# A tibble: 1 × 2
     by            mm
* <dbl>        <list>
1     1 <dbl [1 × 2]>

$range[[5]]
Source: local data frame [1 x 2]
Groups: <by row>

# A tibble: 1 × 2
     by            mm
* <dbl>        <list>
1     1 <dbl [1 × 2]>

$range[[6]]
Source: local data frame [1 x 2]
Groups: <by row>

# A tibble: 1 × 2
     by            mm
* <dbl>        <list>
1     1 <dbl [1 × 2]>
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Zelig_5.0-13

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9        nloptr_1.0.4       plyr_1.8.4         miscTools_0.6-22   tools_3.3.2       
 [6] lme4_1.1-13        MatchIt_2.4-21     jsonlite_1.2       tibble_1.2         nlme_3.1-128      
[11] lattice_0.20-34    mgcv_1.8-16        Matrix_1.2-7.1     DBI_0.5-1          maxLik_1.3-4      
[16] parallel_3.3.2     SparseM_1.74       coda_0.19-1        AER_1.2-5          dplyr_0.5.0       
[21] MatrixModels_0.4-1 stats4_3.3.2       lmtest_0.9-34      grid_3.3.2         nnet_7.3-12       
[26] R6_2.2.0           geepack_1.2-1      survival_2.40-1    VGAM_1.0-3         foreign_0.8-67    
[31] Amelia_1.7.4       minqa_1.2.4        Formula_1.2-1      car_2.1-4          magrittr_1.5      
[36] mcmc_0.9-4         MASS_7.3-45        splines_3.3.2      assertthat_0.1     pbkrtest_0.4-6    
[41] quantreg_5.29      sandwich_2.3-4     survey_3.31-5      MCMCpack_1.3-8     lazyeval_0.2.0    
[46] zoo_1.7-14    
christophergandrud commented 7 years ago

Ah, I see that object needs to go into the ci.plot function.

I was confused because the Zelig5 examples use the graph method rather than the ci.plot function.

I'm wondering if if this could be improved from a UX perspective in two ways:

christophergandrud commented 7 years ago

Added a Zelig5 setrange/ci.plot example to the README: 0b719a9bc4fd3fda85a5537091323b117806e909

christophergandrud commented 7 years ago

The graph method now can tell if a range of x have been provided and call ci.plot: 77ebffc7bd0f76e297527c6e1514bb808d1135d6

izahn commented 7 years ago

Here are my thoughts about this. They may not be coherent, but perhaps useful nonetheless.

There is as fundamental tension between setting a range of x values and providing first differences. If we use setx and setx1 then first differences make sense. If we specify a range of x values than first differences really don't make sense. Providing first differences is one of the "selling points" of Zelig, so setx and setx1 are emphasized.

Resolving this tension requires recognizing that we can't have our cake and eat it too. There are two different things that we want to do, and we need a clear distinction between them. I would argue that the only advantage of setx/setx1 over setxrange is that they allow for first differences. In all other cases setxrange is the way to go. If we agree that this is true (it may not be, these are just my ramblings) then I propose that we a) get rid of setxrange and setx1, b) use setx to mean what setxrange currently means, and c) write a new function firstdiff that does what setx/setx1 currently does.

I have more thoughts along these lines, but I'll stop here to see how this line of thought is received before proposing even more radical changes.

christophergandrud commented 7 years ago

@izahn that is a pretty intriguing idea. Making the substantive uses of setx and setrange functionality more explicit could definitely help users.

I guess one big thing to think through is backwards compatibility and minimising software complexity.

tercer commented 7 years ago

Two points: [1] About method and wrapper calls.

setrange() isn't necessary when using the wrappers, and this is one place where the wrapper and method behaviour could be brought closer together. Here's a working code example using just setx wrapper:

screen shot 2017-01-24 at 1 10 10 pm screen shot 2017-01-24 at 1 10 31 pm

[2] About Ista's point.

The range argument is most commonly used for showing predicted or expected values over a range of some variable of interest. The concept for this goes back all the way to figure 1 of King Tomz Wittenburg, and in my mind is pretty central to what Zelig should do. It didn't always work in Zelig 3.5 and when I first saw the Zelig 4 code had been removed.

While I agree that first differences are a great way to interpret models (especially nonlinear ones), predicted and expected values have their place. Without some adaptation of the range approach, one is left with tables of predicted and expected values at very static points in the space of the covariates. If I have a paper with a number of variables, but one of key interest, I'm probably going to tell illustrate a story of how things change as I move across my key variable. Knowing how to build such a graph yourself has always been a key assignment of any grad methods class I've taught, but it's a lot of leg work. Showing it can be done easily in Zelig is normally a moment when it clicks that Zelig can be useful for abstracting away a lot of the effort of model interpretation.

Anyway, in the way they have evolved in Zelig, setx has always been about giving predicted and expected values, and setx1 (which is a strange name, but seems stuck with us) has been the additional function if one also wanted first differences.

(Final point is, that it is also possible to use the ci.plot for creating first differences across two sets of ranges. What that means is very specific to how it is constructed, but is potentially a useful illustrative technique also.)

christophergandrud commented 7 years ago
christophergandrud commented 7 years ago

The workflow for setx with interactions also needs improvement.

For example, this runs fine:

  states <- as.data.frame(state.x77)
  z <- zelig(Murder ~ Income * Population, data = states, model = 'ls')
  s1 <- setx(z, Population = 1000:2000, Income = 3098)
  s2 <- setx(z, Population = 1000:2000, Income = 6315)
  z <- sim(z, x = s1, x1 = s2)
  z$graph()

But this doesn't:

states <- as.data.frame(state.x77)
z <- zelig(Murder ~ Income + Population, data = states, model = 'ls')
z <- z$setrange(Population = 1000:2000, Income = 3098)
z <- z$setrange1(Population = 1000:2000, Income = 6315)

## Error: attempt to apply non-function

More details:

states <- as.data.frame(state.x77)
z <- zelig(Murder ~ Income + Population, data = states, model = 'ls')
z <- z$setrange(Population = 1000:2000, Income = 3098)
summary(z)

## Length  Class   Mode 
##     0   NULL   NULL 
christophergandrud commented 7 years ago

Related to: https://github.com/IQSS/Zelig/issues/75