Open christophergandrud opened 7 years ago
This is the same underlying issue as #128 #223 IQSS/ZeligChoice#14 IQSS/ZeligChoice#17 , that is, there is no presently enabled syntax for systems of equations.
The trajectory on this issue, was that it was considered an essential part of multilevel/hierarchical models, and so a syntax would be devised in hand with making those models, which are #35 #113 . There was a previous syntax in version 3.5.
These are things @cchoirat has been working on/thinking about for a while. Maybe she can give us a present view on what syntax would work best. I'm putting this up as background, that might be helpful if you're self assigning the issue, or thinking of taking over these tasks. If you decide to reassign this to yourself, I'd be happy to join in.
If @cchoirat is already on this, then that's great.
I guess one issue is whether we want to pull back in what was in v3.5 to reestablish backwards compatibility (John uses this example in teaching and it used to work), before getting to the broader solution? Or should it all be rolled out at once?
fyi: Multiple formulas are mentioned for mlogit
in the current documentation: http://docs.zeligproject.org/en/latest/zeligchoice-mlogit.html
If I understand correctly (using the source from the CRAN archive) in Zelig 3.5, mlogit
was specified using:
zelig2mlogit <- function(formula, model, data, M, ...) {
mf <- match.call(expand.dots = TRUE)
mf[[1]] <- VGAM::vglm
mf$family <- VGAM::multinomial
formula<-parse.formula(formula,model,data)
tt<-terms(formula)
fact<-attr(tt,"depFactors")$depFactorVar
ndim<-length(attr(tt,"depFactors")$depLevels)
tmp <- cmvglm(formula, mf$model, ndim,data,fact)
mf$formula <- tmp$formula
mf$constraints <- tmp$constraints
mf$model <- mf$M <- NULL
as.call(mf)
}
with a lot of the work of parsing the formula being done by an internal function called parse.formula
(As it's no longer in the Zelig source, I've re-hosted it here: https://gist.github.com/christophergandrud/42bccce985640c5a43948523f1ae46ad)
Also looks like we need cmvglm
:
cmvglm <- function(formula, model, ndim,data=NULL, fact=NULL){
toBuildFormula<-function(Xnames,sepp="+"){
lng<-length(Xnames)
rhs<-NULL
if (lng!=0){
if(lng==1){
rhs=Xnames
}else{
for (j in 1:(lng-1)){
rhs<-paste(rhs,as.name(Xnames[[j]]))
rhs<-paste(rhs,sepp)
}
rhs<-paste(rhs,Xnames[[lng]])
}
}
return (rhs)
}
tt<-terms(formula)
attr(tt,"systEqns")<-names(formula)
p<-make.parameters(tt,shape="matrix")
vars<-rownames(p)
cm<-vector("list", length(vars))
names(cm)<-vars
for(i in 1:length(cm))
cm[[i]]<-diag(1, ndim)
constrain<-attr(tt,"constraints")
if(!is.logical(constrain)){
tmp <- sort(colnames(constrain))
for (i in 1:length(tmp)) {
ci<-constrain[,i]
if (is.null(na.omit(ci)) || length(unique(na.omit(ci)))!=1)
stop("invalid input for constrain")
minj <- match(FALSE, is.na(ci))
whatvar <- pmatch(unique(na.omit(ci)), names(cm))
for (j in 1:3)
if (!is.na(ci[j])) {
cm[[whatvar]][j,j]<-0
cm[[whatvar]][j,minj]<-1
}
}
}
for(i in rownames(p)){
for(j in 1:ncol(p)){
if(is.na(p[i,j]))
cm[[i]][j,j]<-0
}
}
# if(!is.null(constant))
# for(i in 1:length(constant))
# for(j in 1:length(cm))
# if(names(cm)[j]!="(Intercept)")
# cm[[j]][constant[i],]<-matrix(0, ncol=ncol(cm[[j]]))
for(i in 1:length(cm))
cm[[i]]<-as.matrix(cm[[i]][,apply(cm[[i]], 2, sum)!=0])
rhs<-toBuildFormula(attr(tt,"indVars"))
if(!(is.null(rhs)))
rhs<-(paste("~",rhs))
else
rhs<-"~1"
Ynames<-unlist(attr(tt,"depVars"))
if(!is.null(fact))
lhs<-fact
else{
if(length(Ynames)>1){
lhs<-toBuildFormula(Ynames,",")
if (!(is.null(lhs))){
lhs<-paste("cbind(",lhs)
lhs<-paste(lhs,")")
}
}else{
lhs=Ynames
}
}
formula<-as.formula(paste(lhs,rhs))
list("formula"=formula, "constraints"=cm)
}
I guess my question is, what are the impediments to reintroducing this set up to at least establish backwards compatibility for this feature in mlogit
(at least as a bandage while we work on a more general solution)?
(I removed the documentation issue for mlogit
in zeligproject.org.)
Here is a more easily reproducible example (from the old docs):
library(Zelig)
library(ZeligChoice)
data("mexico")
z.out2 <- zelig(list(id(vote88,"1") ~ pristr + othcok,
id(vote88,"2") ~ othsocok),
model = "mlogit", data = mexico)
One option, that I don't really like, would be to wrap mlogit rather than VGAM.
On the positive side, this uses a version of Formula that allows much more complex relationships than is allowed in Zelig. It doesn't look like it would break anything that isn't already broken. It would also broadly match a multiple equations syntax based on Formula more broadly.
On the negative side (at least) there is a non-Zelig-style mlogit.data
stage to format the data pre-estimation.
On the plus and negative side: This could work for all models in ZeligChoice except for the bivariate logit/probit. Though using Formula, we might be able to reimplement the later models in a unified syntax, with perhaps measures taken for backwards compatibility (Formula allows us to specify y1 + y2 ~ x1
, but the current implementation is cbind(y1, y2) ~ x1
Just as a point of information, in terms of ever thinking about standardizing, the EI models use the cbind(y1, y2)~
notation currently, while the survival models use `Surv(y, c)~' (for outcome y, censored indicator c). Both of those I believe were adhering to what was used in Zelig 3.5.
Thanks for pointing those out!
For Survival models . . . maybe we should keep the Surv(. . . .)
syntax as this matches the syntax in the Survival package and really has a different set of needs than what Formula addresses.
End of Max Kuhn's post: on R Formula's the bad parts
Some limitations of the current formula interface can be mitigated by writing your own or utilizing the Formula package.
However, there are a number of conceptual aspects (e.g., roles, sequential processing) that would require a completely different approach to defining a design matrix, and this will be the focus of an upcoming tidyverse package.
Issue is also with bivariate probit. See: https://groups.google.com/forum/#!topic/zelig-statistical-software/n5CQnXeQvAM
Reported by John Jackson via email:
Running a multinomial logit and specifying different variable lists for some categories returns the following error message,
"Error in formula.default(object, env = baseenv()) : invalid formula"
, e.g.:If list of right hand side variables is not varied:
it works fine.