implement svychisq for multiply-imputed data

ajdamico commented 7 years ago

https://stats.stackexchange.com/questions/78479/how-to-run-chi-squared-test-on-imputed-data

i am not totally sure how to do it (one commenter says there's a mistake). mitools:::MIcombine.default does not properly deal with a list of htest objects (the this_result object below).

if you think it's something you can implement, we can temporarily put it in lodown (which has lots of other dataset specific helper functions, like pnad_postStratify) and ask dr. lumley if he would prefer for it to go in library(mitools)? currently have two others implemented at: https://github.com/ajdamico/lodown/blob/master/R/survey_functions.R

thanks

library(mitools)
library(survey)

imp1 <- data.frame( col1 = sample( 1:5 , 100 , replace = TRUE ) , col2 = sample( 1:5 , 100 , replace = TRUE ) )
imp2 <- data.frame( col1 = sample( 1:5 , 100 , replace = TRUE ) , col2 = sample( 1:5 , 100 , replace = TRUE ) )

imp.design <- svydesign( ~1 , data = imputationList( list( imp1 , imp2 ) ) )

# svytotal works fine
MIcombine( with( imp.design , svytotal( ~col1 ) ) )

this_result <- with( imp.design , svychisq( ~col1+col2 ) )

# svychisq fails inside of mitools:::MIcombine.default
MIcombine( this_result )

# object types
class( this_result )
lapply( this_result , class )

ajdamico commented 7 years ago

@guilhermejacob could you:

(1) add your code here with maybe two easily-runnable test cases (2) describe what you've done to djalma and see if he agrees with your math (3) integrate your code into https://github.com/ajdamico/lodown/blob/master/R/survey_functions.R (4) decide whether to propose it to lumley as an addition to library(mitools)

guilhermejacob commented 7 years ago

Ok, I was able to replicate this example (I don't think it's great, but works). There's a lot of warnings to this method.

And I don't know how to integrate that so It will run on all Wald statistics. Anyways, I think it's kind of good.

So answering your points in the previous comments:

(1) If you want to git it a try run this:

# our test function
test_fun <- function (results, variances, call = sys.call(), df.complete = Inf, ...) {
  m <- length(results)
  if ( m != 3 ) { warning( "This test was designed for m = 3. Use it cautiously.")}
  oldcall <- attr(results, "call")
  ndf <- results[[1]]$parameter[[1]]
  if (missing(variances)) {
    results <- lapply(results, function(x) x$statistic )
    variances <- sqrt( unlist(results) )
  }
  cbar <- results[[1]]
  for (i in 2:m) {
    cbar <- cbar + results[[i]]
  }

  cbar <- cbar/m
  evar <- var(variances)

  r <- (1 + 1/m) * evar

  D_2 <- ( cbar/ndf - r * ( m + 1 ) / ( m - 1) ) / ( 1 + r )

  v3 <- (ndf^(-3/m)) * (m-1) * ( 1 + 1/r)^2

  pval <- 1 - stats::pf( D_2 , df1 = ndf , df2 = v3 , ncp = FALSE )[[1]]

  rval <- list(statistic = D_2 , ndf = ndf , ddf = v3 , p.value = pval )
  #class(rval) <- "list"
  warning( "The real p-value could be half or twice the presented below.")
  rval
}

library(mitools)
library(survey)

imp1 <- data.frame( col1 = sample( 1:5 , 100 , replace = TRUE ) , col2 = sample( 1:5 , 100 , replace = TRUE ) )
imp2 <- data.frame( col1 = sample( 1:5 , 100 , replace = TRUE ) , col2 = sample( 1:5 , 100 , replace = TRUE ) )

imp.design <- svydesign( ~1 , data = imputationList( list( imp1 , imp2 ) ) )
# imp.design <- as.svrepdesign( imp.design , replicates = 100 )

# svytotal works fine
MIcombine( with( imp.design , svytotal( ~col1 ) ) )

# svytotal works fine
MIcombine( with( imp.design , svymean( ~col1 ) ) )

this_result <- with( imp.design , svychisq( ~col1+col2 , statistic = "Chisq" ) )

# svychisq fails inside of mitools:::MIcombine.default
MIcombine( this_result )

results <- with( imp.design , svychisq( ~col1+col2 ) ) 

test_fun( results = results )

(2) It takes the results statistics, calculates the variance of the statistics square roots across the m imputations, the calculates the statistic in Li et al. (1991). From that we get the p-value using the degrees of freedom from the chi-square and the formula'a denominator degrees of freedom.

(3) We have to think about that. The only real improvement of this function is minimal: it gets the dfs from the chi-square test. If you just supply it with the test, you could use other package like miceadds to do it.

(4) Discussion ahead.

guilhermejacob commented 7 years ago

The test was meant for cases when m = 3. So, We have the following warnings:

The pooled Chi^2-test can be used when k is large, if U and B cannot be retrieved, or if only Chi^2-statistics are available. Compared to the other three methods, however, the results from the Chi^2-test are considerably less reliable. The results were optimized for m = 3 and, unlike the other tests, do not necessarily improve for larger m. According to Li et al. (1991a) the true result could be within a range of one half to twice the obtained p-value. This test should only be used as a rough guide. - Van Buuren (2012, p.159).

and

Li, Meng, et al. (1991) used Monte Carlo simulations to study the performance of D2 statistic under a variety of conditions. Their results suggest that type I error rates can either be too high or too low, depending on the fraction of missing information (e.g., when the fraction of missing information was less than 20%, type I errors dropped below the nominal 0.05 level). Their simulations also indicate that D2 has lower power than D1. Considered as a whole, these simulation results suggest that D2 does not yield accurate inferences, and the authors recommend using the procedure “primarily as a screening test statistic” (p. 83). - Enders (2010, p.240).

The original article can be found here: Li et al. (1991).

DjalmaPessoa commented 7 years ago

It is not clear to me the step:

variances <- sqrt( unlist(results))

Could you please cite the formula in Li et al. (1991) that you are using to combine the results?

I believe a better way of testing the independence in a two-way table could be to fit a loglinear model using the function svyloglin and testing if the interation effect is null. It might be possible to use MIcombine for that.

guilhermejacob commented 7 years ago

I think it's equation 2.2 in page 74. I'm taking the square roots of every imputed result. The part inside the square brackets in 2.2 is the sample variance of the square roots of each results. Makes sense, right?

DjalmaPessoa commented 7 years ago

are you using the formula in:

https://stats.stackexchange.com/questions/78479/how-to-run-chi-squared-test-on-imputed-data with the correction?

On Mon, May 29, 2017 at 3:55 PM, Guilherme Jacob notifications@github.com wrote:

I think it's equation 2.2 in page 74. I'm taking the square roots of every imputed result. The part inside the square brackets in 2.2 is the sample variance of the square roots of each results. Makes sense, right?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ajdamico/asdfree/issues/236#issuecomment-304715303, or mute the thread https://github.com/notifications/unsubscribe-auth/AFD-p1LbyIiOpqYargztrDwfXSoj5nI4ks5r-xS7gaJpZM4NamHe .

DjalmaPessoa commented 7 years ago

Where is defined the distribution of the test statistic in 2.1. What are the number of degrees of freedom in the numerator and in the denominator of the F distribution?

guilhermejacob commented 7 years ago

Yes, I'm using the correct formula. It's also the same in the books above. The numerator degrees of freedom k comes from are those from the Chi-square test (as in p.67 of Li et al., 1991). The denominator degrees of freedom are defined in equations 2.16 and 2.17 of Li et al. (1991).

DjalmaPessoa commented 7 years ago

you mentioned that:

The test was meant for the cases when m = 3 ?

In your example k= (c-1)x(r-1)= 16 , the value of k has no influence in the applicability?

guilhermejacob commented 7 years ago

It doesn't seem to be a problem from the books I read, so I don't think it's a problem. Section 2.4 in the article might give some additional information about this.

DjalmaPessoa commented 7 years ago

It is good you believe in your books..., Multiple imputation is bit misterious and magic to me... For k=16 if the missingness fraction is high, m=3 sounds magic!

The test statistics I got by applying the function test_fun in the example is -0.5042679 which is strange since the reference distribution is F, whose support is the positive real line. Peharps you should truncate the value when negative.

In the example you used the parameter statistic = "Chisq". Is it valid to use statistic = "Wald" ?

The chisq-test for complex surveys has to be corrected to be valid. The Wald test is more natural because in its definition already uses the correct covariance matrix estimated from the sample design.

On Tue, May 30, 2017 at 2:39 PM, Guilherme Jacob notifications@github.com wrote:

It doesn't seem to be a problem from the books I read, so I don't think it's a problem. Section 2.4 in the article might give some additional information about this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ajdamico/asdfree/issues/236#issuecomment-304953308, or mute the thread https://github.com/notifications/unsubscribe-auth/AFD-pyoyI1-X9N5sNWCFh8A08qm6Cz4oks5r_FRRgaJpZM4NamHe .

guilhermejacob commented 7 years ago

Sometimes, it's a matter of faith haha It should work with Wald too. But X^2 uses the Rao-Scott adjustment, if that's what you're asking.

DjalmaPessoa commented 7 years ago

I just said I'd rather use Wald's test!

guilhermejacob commented 7 years ago

@DjalmaPessoa , you are absolutely right. After rereading your comments, I decided to start from scratch. Instead of writing it, I'll just borrow the miceadds::micombine.chisquare function, passing just two arguments: the pooled results from svychisq and the Chi-squared degrees of freedom.

(1) add your code here with maybe two easily-runnable test cases

MIsvychisq<-function(formula, design , statistic = "Chisq" , ... ) {

  if ( !( statistic %in% c( "Chisq" ) ) ) { stop( " This method is only implemented for `statistic = 'Chisq'`." ) }

  m <- with( design , svychisq( formula , statistic = statistic ) ) 

  dk  <- as.numeric( sapply( m , FUN = function( x ) x[["statistic"]] ) ) 
  df <- as.numeric( sapply( m , FUN = function( x ) x[["parameter"]][ "df" ] ) )

  return( miceadds::micombine.chisquare( dk=dk, df=df[[1]] , display = TRUE , version = 1 ) ) 

}

library(mitools)
library(survey)

imp1 <- data.frame( col1 = sample( 1:5 , 100 , replace = TRUE ) , col2 = sample( 1:5 , 100 , replace = TRUE ) )
imp2 <- data.frame( col1 = sample( 1:5 , 100 , replace = TRUE ) , col2 = sample( 1:5 , 100 , replace = TRUE ) )
imp3 <- data.frame( col1 = sample( 1:5 , 100 , replace = TRUE ) , col2 = sample( 1:5 , 100 , replace = TRUE ) )

imp.design <- svydesign( ~1 , data = imputationList( list( imp1 , imp2 , imp3 ) ) )

MIsvychisq( formula = ~col1+col2 , imp.design )

It defaults to statistic = 'Chisq', as I'm not sure how it works with F-statistics. (2) describe what you've done to djalma and see if he agrees with your math Well, there's no math by my part here. (3) integrate your code into https://github.com/ajdamico/lodown/blob/master/R/survey_functions.R Will do. (4) decide whether to propose it to lumley as an addition to library(mitools) This function is too small to be a meaningful addition to survey.

guilhermejacob commented 7 years ago

For pooled F-statistics, this should help: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4029775/pdf/nihms560910.pdf

DjalmaPessoa commented 7 years ago

Example using Lumley's suggestion:

library(mitools) data.dir<-system.file("dta",package="mitools") files.men<-list.files(data.dir,pattern="m.\.dta$",full=TRUE) men<-imputationList(lapply(files.men, foreign::read.dta)) files.women<-list.files(data.dir,pattern="f.\.dta$",full=TRUE) women<-imputationList(lapply(files.women, foreign::read.dta)) men<-update(men, sex=1) women<-update(women,sex=0) all<-rbind(men,women) library(survey) designs<-svydesign(id=~id, strata=~sex, data=all) results_loglinear <- with(designs, svyloglin(~sex*alcdos)) MIcombine(results_loglinear)

but if we want to test if the interaction effect between sex and alcdos is null we need to get the p-values for the combined object. This would correspond to the test using svychisq with Rao-Scott correction.

ajdamico / asdfree

implement svychisq for multiply-imputed data #236