Multinomial logit with different variable lists by response category

christophergandrud commented 7 years ago

Reported by John Jackson via email:

Running a multinomial logit and specifying different variable lists for some categories returns the following error message, "Error in formula.default(object, env = baseenv()) : invalid formula", e.g.:

summary(churchpower.out <- zelig(list(
    id(churchpower,"0")~comprehend+interest+cooperate+age+interest+ed+female+retired+dpolecsit+churchatt+priests,
    id(churchpower,"1")~age+interest+ed+female+retired+dpolecsit+churchatt+priests,
    id(churchpower,"2")~age+interest+ed+female+retired+dpolecsit+churchatt+priests,
    id(churchpower,"3")~age+interest+ed+female+retired+dpolecsit+churchatt+priests,
    id(churchpower,"4")~age+interest+ed+female+retired+dpolecsit+churchatt+priests),
    model="mlogit", data=pgss02)) 

## Error in formula.default(object, env = baseenv()) : invalid formula

If list of right hand side variables is not varied:

summary(churchpower.out <- zelig(as.factor(churchpower)~age+ed+inc+female+dfinsit+dpolecsit+pnew, 
    model="mlogit", data=pgss02))

it works fine.

tercer commented 7 years ago

This is the same underlying issue as #128 #223 IQSS/ZeligChoice#14 IQSS/ZeligChoice#17 , that is, there is no presently enabled syntax for systems of equations.

The trajectory on this issue, was that it was considered an essential part of multilevel/hierarchical models, and so a syntax would be devised in hand with making those models, which are #35 #113 . There was a previous syntax in version 3.5.

These are things @cchoirat has been working on/thinking about for a while. Maybe she can give us a present view on what syntax would work best. I'm putting this up as background, that might be helpful if you're self assigning the issue, or thinking of taking over these tasks. If you decide to reassign this to yourself, I'd be happy to join in.

christophergandrud commented 7 years ago

If @cchoirat is already on this, then that's great.

I guess one issue is whether we want to pull back in what was in v3.5 to reestablish backwards compatibility (John uses this example in teaching and it used to work), before getting to the broader solution? Or should it all be rolled out at once?

christophergandrud commented 7 years ago

fyi: Multiple formulas are mentioned for mlogit in the current documentation: http://docs.zeligproject.org/en/latest/zeligchoice-mlogit.html

christophergandrud commented 7 years ago

If I understand correctly (using the source from the CRAN archive) in Zelig 3.5, mlogit was specified using:

zelig2mlogit <- function(formula, model, data, M, ...) {
        mf <- match.call(expand.dots = TRUE)
        mf[[1]] <- VGAM::vglm 
        mf$family <- VGAM::multinomial
        formula<-parse.formula(formula,model,data)
        tt<-terms(formula)
        fact<-attr(tt,"depFactors")$depFactorVar
        ndim<-length(attr(tt,"depFactors")$depLevels)
        tmp <- cmvglm(formula, mf$model, ndim,data,fact)
        mf$formula <- tmp$formula  
        mf$constraints <- tmp$constraints
        mf$model <- mf$M <- NULL
        as.call(mf)
}

with a lot of the work of parsing the formula being done by an internal function called parse.formula (As it's no longer in the Zelig source, I've re-hosted it here: https://gist.github.com/christophergandrud/42bccce985640c5a43948523f1ae46ad)

Also looks like we need cmvglm:

cmvglm <- function(formula, model, ndim,data=NULL, fact=NULL){

  toBuildFormula<-function(Xnames,sepp="+"){
    lng<-length(Xnames)
    rhs<-NULL
    if (lng!=0){
      if(lng==1){
        rhs=Xnames
      }else{
        for (j in 1:(lng-1)){
          rhs<-paste(rhs,as.name(Xnames[[j]]))
          rhs<-paste(rhs,sepp)
        }
        rhs<-paste(rhs,Xnames[[lng]])
      }
    }
    return (rhs)
  }
  tt<-terms(formula)
  attr(tt,"systEqns")<-names(formula)
  p<-make.parameters(tt,shape="matrix")
  vars<-rownames(p)
  cm<-vector("list", length(vars))
  names(cm)<-vars

    for(i in 1:length(cm))
      cm[[i]]<-diag(1, ndim)

  constrain<-attr(tt,"constraints")
  if(!is.logical(constrain)){
    tmp <- sort(colnames(constrain))
    for (i in 1:length(tmp)) {
      ci<-constrain[,i]
      if (is.null(na.omit(ci)) || length(unique(na.omit(ci)))!=1)
        stop("invalid input for constrain")
      minj <- match(FALSE, is.na(ci))
      whatvar <- pmatch(unique(na.omit(ci)), names(cm))
      for (j in 1:3)
        if (!is.na(ci[j])) {
          cm[[whatvar]][j,j]<-0
          cm[[whatvar]][j,minj]<-1
        }
    }
  }
  for(i in rownames(p)){
    for(j in 1:ncol(p)){
      if(is.na(p[i,j]))
        cm[[i]][j,j]<-0
    }
  }

 # if(!is.null(constant))
 #   for(i in 1:length(constant))
 #     for(j in 1:length(cm))
 #       if(names(cm)[j]!="(Intercept)")
 #         cm[[j]][constant[i],]<-matrix(0, ncol=ncol(cm[[j]]))

  for(i in 1:length(cm))
    cm[[i]]<-as.matrix(cm[[i]][,apply(cm[[i]], 2, sum)!=0])
  rhs<-toBuildFormula(attr(tt,"indVars"))
  if(!(is.null(rhs)))
    rhs<-(paste("~",rhs))
  else
    rhs<-"~1"
  Ynames<-unlist(attr(tt,"depVars"))
  if(!is.null(fact))
    lhs<-fact
  else{
    if(length(Ynames)>1){
      lhs<-toBuildFormula(Ynames,",")
      if (!(is.null(lhs))){
        lhs<-paste("cbind(",lhs)
        lhs<-paste(lhs,")")
      }
    }else{
      lhs=Ynames
    }
  }
  formula<-as.formula(paste(lhs,rhs))
  list("formula"=formula, "constraints"=cm)
}

christophergandrud commented 7 years ago

I guess my question is, what are the impediments to reintroducing this set up to at least establish backwards compatibility for this feature in mlogit (at least as a bandage while we work on a more general solution)?

tercer commented 7 years ago

(I removed the documentation issue for mlogit in zeligproject.org.)

christophergandrud commented 7 years ago

Here is a more easily reproducible example (from the old docs):

library(Zelig)
library(ZeligChoice)

data("mexico")
z.out2 <- zelig(list(id(vote88,"1") ~ pristr + othcok, 
                               id(vote88,"2") ~ othsocok), 
                model = "mlogit", data = mexico)

christophergandrud commented 7 years ago

One option, that I don't really like, would be to wrap mlogit rather than VGAM.

On the positive side, this uses a version of Formula that allows much more complex relationships than is allowed in Zelig. It doesn't look like it would break anything that isn't already broken. It would also broadly match a multiple equations syntax based on Formula more broadly.
On the negative side (at least) there is a non-Zelig-style mlogit.datastage to format the data pre-estimation.
On the plus and negative side: This could work for all models in ZeligChoice except for the bivariate logit/probit. Though using Formula, we might be able to reimplement the later models in a unified syntax, with perhaps measures taken for backwards compatibility (Formula allows us to specify y1 + y2 ~ x1, but the current implementation is cbind(y1, y2) ~ x1

tercer commented 7 years ago

Just as a point of information, in terms of ever thinking about standardizing, the EI models use the cbind(y1, y2)~ notation currently, while the survival models use `Surv(y, c)~' (for outcome y, censored indicator c). Both of those I believe were adhering to what was used in Zelig 3.5.

christophergandrud commented 7 years ago

Thanks for pointing those out!

christophergandrud commented 7 years ago

For Survival models . . . maybe we should keep the Surv(. . . .) syntax as this matches the syntax in the Survival package and really has a different set of needs than what Formula addresses.

christophergandrud commented 7 years ago

End of Max Kuhn's post: on R Formula's the bad parts

Some limitations of the current formula interface can be mitigated by writing your own or utilizing the Formula package.

However, there are a number of conceptual aspects (e.g., roles, sequential processing) that would require a completely different approach to defining a design matrix, and this will be the focus of an upcoming tidyverse package.

christophergandrud commented 6 years ago

Issue is also with bivariate probit. See: https://groups.google.com/forum/#!topic/zelig-statistical-software/n5CQnXeQvAM

IQSS / Zelig

Multinomial logit with different variable lists by response category #240