ajdamico / convey

variance of distribution measures estimation of survey data
GNU General Public License v3.0
17 stars 7 forks source link

determine whether to compute variances off of full vs subsetted design #67

Closed ajdamico closed 8 years ago

ajdamico commented 8 years ago

RE paragraph below, if you are not sure which computation to use, then please add a parameter to all functions (not just svygini) that uses full_design or just the subset depending on whether the user specifies TRUE or FALSE for the function parameter?

please also include as much discussion in the ?help for the theoretical basis for one choice vs another?

After that, I decided to compute variances using the object "full_design" only for indicators involving the threshold apt which is computed for the whole population. For the other indicator it is used "design", which for domains is a subset design. Now for these indicators I'm getting values of the "se" quite different from vardpoor.

DjalmaPessoa commented 8 years ago

To make more clear what I mean, look for in all the functions of the library convey the design object used in the function svyrecvar. Only the functions: svyarpr svyartpt svypoormed svyrmpg involving the threshold arpt (based on the whole population) use the object full_design. The se estimates for these functions match the estimates from vardpoor.

All the other functions use design which is the subset design in the case of domain estimates.

In this case, at least for the functions contained in vardpoor, the results do not match.

The vardpoor library works all the time, by using indicators, only with variables having the full length and the full_design, even for domain estimation.

Note that this is issue is not relevant for svrep designs. By comparing se linearized estimates generated by convey and by vardpoor with those generated by the svrep design, we see that convey estimates are much closer.

I believe it would also be important to get examples from other sources to compare, such as published results that we would try to replicate.

I don't agree with the suggestion of including the two alternatives in each function, because certainly one of them must be wrong! So far I was not able to spot which one it is.

DjalmaPessoa commented 8 years ago

Here is a reproducible script to compare vardpoor and survey to estimate qsr:

  library(convey)
  library(vardpoor)
  data(eusilc) ; names( eusilc ) <- tolower( names( eusilc ) )
  library(survey)

  des_eusilc <- svydesign(ids = ~rb030, strata =~db040,  weights = ~rb050, data = eusilc)

  dati <- data.table(IDd=1:nrow(eusilc),eusilc)

  ## estimate qsr for the whole population using library vardpoor

  dd <- linqsr(Y="eqincome", id="IDd", weight="rb050",Dom=NULL, dataset= dati, alpha=20)

  ## estimate qsr for the whole population using library convey
  ddcon<-svyqsr( ~eqincome , design = des_eusilc, upper_tot = TRUE, lower_tot = TRUE )

  # compare the linearized variable obtained by the two libraries
  all.equal(attr(ddcon,"lin"), dd$lin$lin_qsr)

  ## Estimate qsr for domain using the library vardpoor

  dd.dom <- linqsr(Y="eqincome", id="IDd", weight="rb050",   Dom="db040", dataset= dati, alpha=20)

  ## the linearized qsr variable for the domain "Tyrol" using vardpoor

  linvardpoor <- dd.dom$lin$lin_qsr__db040.Tyrol

  ## estimate qsr for a the domain "Tyrol" using conevy

  ddconTyrol <- svyqsr( ~eqincome, ~db040, design =subset( des_eusilc,db040=="Tyrol" ),
    alpha=.20)

  ## the linearized qsr variable for the domain "Tyrol" using convey

  lindom.conv <-attr(ddconTyrol, "lin")

  ## function to estimate the  se of the total of a vector t using design des:

  SE_lin2 <- function(t,des){
    variance<-survey::svyrecvar(t/des$prob, des$cluster,des$strata, des$fpc,
    postStrata = des$postStrata)
    sqrt(variance)
  }

  # se esimate of qsr for domain Tyrol using vardpoor

  SE_lin2(linvardpoor,des_eusilc )

  # se esimate of qsr for domain Tyrol using vardpoor

  SE_lin2(lindom.conv, subset( des_eusilc,db040=="Tyrol" ) )

  # these are different!
DjalmaPessoa commented 8 years ago

Sorry I meant "Here is a reproducible script to compare vardpoor and convey to estimate qsr:"

ajdamico commented 8 years ago

hi djalma, i believe choosing between full vs subset depends on what you find in the texts you are attempting to match. it would be great if you can also provide the vardpoor people with a reproducible example that matches their output but does not match textbook published docs, since they have been helpful for us

ajdamico commented 8 years ago

note that this choice might be similar to the

options("survey.replicates.mse")

in the sense that it defaults to the setting that the author deems most appropriate but can be very easily toggled on/off to match other publications. thomas lumley defaults mse=FALSE but the us census bureau always uses mse=TRUE

DjalmaPessoa commented 8 years ago

To understand better what is done in vardpoor to estimate se, consider: library(convey) library(vardpoor) data(eusilc) ; names( eusilc ) <- tolower( names( eusilc ) ) library(survey)

compute the se estimate of the qsr for the domain Tyrol using vardpoor, use the function linQSRCalc

vardpoor:::linQSRCalc(eusilc$eqincome,ids = 1:nrow(eusilc), weights=eusilc$rb050, ind = (eusilc$db040=="Tyrol")*1, alpha = 20) -> est_Tyrol_vardpoor

Now create a new design with weights 0 where db040 =="Tyrol"

des_Tyrol <- svydesign(ids = ~rb030, strata =~db040,
weights = ~I(rb050* (eusilc$db040=="Tyrol")), data = eusilc)

function to compute the se estimate

SE_lin2 <- function(t,des){ variance<-survey::svyrecvar(t/des$prob, des$cluster,des$strata, des$fpc, postStrata = des$postStrata) sqrt(variance) }

compute se for vardpoor: SE_lin2(est_Tyrol_vardpoor$lin$lin, des_eusilc)

compute for convey SE_lin2(attr(lin_w0_Tyrol,"lin"), des_eusilc)

get the same results using the original design!

note also that:

svyqsr( ~eqincome , design = des_Tyrol )

and

svyqsr(~eqincome, subset(des_eusilc, db040=="Tyrol"))

give the same results

ajdamico commented 8 years ago

thanks for this code! while you continue deciding which way convey should behave by default, i will attempt a well-documented comparison example between convey and vardpoor that makes it easy for users to understand the difference

DjalmaPessoa commented 8 years ago

I'll be also working in examples starting from very basic estimators: total, ratio, quantile,... to see the difference. Tomorrow I'll be meeting experts from IBGE that could help to clarify this question.

DjalmaPessoa commented 8 years ago
   comparing se estimates for the domain "Tyrol"

  library(convey)
  library(vardpoor)
  data(eusilc) ; names( eusilc ) <- tolower( names( eusilc ) )
  library(survey)

  function to estimate the se
  SE_lin2 <- function(t,des){
    variance<-survey::svyrecvar(t/des$prob, des$cluster,des$strata, des$fpc,
      postStrata = des$postStrata)
    sqrt(variance)
  }

  des_eusilc <- svydesign(ids = ~rb030, strata =~db040,  weights = ~rb050, data = eusilc)

  des_eusilc_Tyrol <- subset(des_eusilc, db040=="Tyrol")

   total

  1. convey
  incvar <- des_eusilc_Tyrol$variables$eqincome
  lin.tot.conv <- incvar
  SE_lin2 (lin.tot.conv, des_eusilc_Tyrol)

   2 . vardpoor
  ncom<- names(des_eusilc$prob)
  ind <- names(des_eusilc_Tyrol$prob)
  ID <- rep(1, nrow(des_eusilc))* (ncom %in% ind)
  incvec <- des_eusilc$variables$eqincome
  lin.tot.vardpoor <- incvec*ID
  SE_lin2 (lin.tot.vardpoor, des_eusilc)

   check using survey

  SE(svytotal(~eqincome, des_eusilc_Tyrol))

  mean for domain or ratio:

  # 1. convey
  num <- incvar
  den <- rep(1, nrow(des_eusilc_Tyrol))
  Y <- sum(num* weights(des_eusilc_Tyrol))
  X <- sum(den* weights(des_eusilc_Tyrol))
  R <- Y/X
   linearized variable
  lin.ratio.conv <- (1/X)* (num-R*den )
  SE_lin2 (lin.ratio.conv, des_eusilc_Tyrol)

   2. vardpoor

  num <- incvec*ID
  den <- rep(1, nrow(des_eusilc))*ID
  Y <- sum(num* weights(des_eusilc))
  X <- sum(den* weights(des_eusilc))
  R <- Y/X
   linearized variable
  lin.ratio.vardpoor <- (1/X)* (num-R*den )
  SE_lin2 (lin.ratio.vardpoor, des_eusilc)

  check by survey

  SE(svymean(~eqincome,des_eusilc_Tyrol ))

  median

  convey
  alpha <- .5
  h <- h_fun(incvar, weights(des_eusilc_Tyrol))
  q_alpha <- survey::svyquantile(~eqincome, des_eusilc_Tyrol, .5,
    method = "constant")
  q_alpha <- as.vector(q_alpha)
  Fprime <- densfun( ~eqincome , des_eusilc_Tyrol, q_alpha,h, fun = "F")
  N <- sum ( weights(des_eusilc_Tyrol))
  lin.quant.conv <-  -(1/(N * Fprime)) * ((incvar <= q_alpha) - alpha)

  se estimate

  SE_lin2 (lin.quant.conv, des_eusilc_Tyrol)

  vardpoor

  lin.quant.vardpoor <-  -(1/(N * Fprime)) *ID* ((incvec <= q_alpha) - alpha)

  SE_lin2 (lin.quant.vardpoor, des_eusilc)

  Remarks: in all examples the results match. In these particular case, the linearized variable for vardpoor the ID always enter as a factor. There are formulas for the linearized variable in vardpoor where this is not the case. When this hapen, the se estimates are different!
ajdamico commented 8 years ago

djalma-- you should put this in a vignette so that we don't lose this great work!

On Mon, Jun 6, 2016 at 9:25 AM, DjalmaPessoa notifications@github.com wrote:

comparing se estimates for the domain "Tyrol"

library(convey) library(vardpoor) data(eusilc) ; names( eusilc ) <- tolower( names( eusilc ) ) library(survey)

function to estimate the se SE_lin2 <- function(t,des){ variance<-survey::svyrecvar(t/des$prob, des$cluster,des$strata, des$fpc, postStrata = des$postStrata) sqrt(variance) }

des_eusilc <- svydesign(ids = ~rb030, strata =~db040, weights = ~rb050, data = eusilc)

des_eusilc_Tyrol <- subset(des_eusilc, db040=="Tyrol")

total

  1. convey incvar <- des_eusilc_Tyrol$variables$eqincome lin.tot.conv <- incvar SE_lin2 (lin.tot.conv, des_eusilc_Tyrol)

    2 . vardpoor ncom<- names(des_eusilc$prob) ind <- names(des_eusilc_Tyrol$prob) ID <- rep(1, nrow(des_eusilc))* (ncom %in% ind) incvec <- des_eusilc$variables$eqincome lin.tot.vardpoor <- incvec*ID SE_lin2 (lin.tot.vardpoor, des_eusilc)

    check using survey

    SE(svytotal(~eqincome, des_eusilc_Tyrol))

    mean for domain or ratio:

    1. convey

    num <- incvar den <- rep(1, nrow(des_eusilc_Tyrol)) Y <- sum(num* weights(des_eusilc_Tyrol)) X <- sum(den* weights(des_eusilc_Tyrol)) R <- Y/X linearized variable lin.ratio.conv <- (1/X)* (num-R*den ) SE_lin2 (lin.ratio.conv, des_eusilc_Tyrol)

    1. vardpoor

    num <- incvec_ID den <- rep(1, nrow(des_eusilc))_ID Y <- sum(num* weights(des_eusilc)) X <- sum(den* weights(des_eusilc)) R <- Y/X linearized variable lin.ratio.vardpoor <- (1/X)* (num-R*den ) SE_lin2 (lin.ratio.vardpoor, des_eusilc)

    check by survey

    SE(svymean(~eqincome,des_eusilc_Tyrol ))

    median

    convey alpha <- .5 h <- h_fun(incvar, weights(des_eusilc_Tyrol)) q_alpha <- survey::svyquantile(~eqincome, des_eusilc_Tyrol, .5, method = "constant") q_alpha <- as.vector(q_alpha) Fprime <- densfun( ~eqincome , des_eusilc_Tyrol, q_alpha,h, fun = "F") N <- sum ( weights(des_eusilc_Tyrol)) lin.quant.conv <- -(1/(N * Fprime)) * ((incvar <= q_alpha) - alpha)

    se estimate

    SE_lin2 (lin.quant.conv, des_eusilc_Tyrol)

    vardpoor

    lin.quant.vardpoor <- -(1/(N * Fprime)) ID ((incvec <= q_alpha) - alpha)

    SE_lin2 (lin.quant.vardpoor, des_eusilc)

    Remarks: in all examples the results match. In these particular case, the linearized variable for vardpoor the ID always enter as a factor. There are formulas for the linearized variable in vardpoor where this is not the case. When this hapen, the se estimates are different!

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/DjalmaPessoa/convey/issues/67#issuecomment-223957696, or mute the thread https://github.com/notifications/unsubscribe/AANO5xsfZPp-A678GBOFQKRhfpUoZd_fks5qJB_BgaJpZM4ItL4N .

DjalmaPessoa commented 8 years ago

the present situation is the following:

  1. the functions svyarpt, svyarpr, svypoormed and svyrmpg use the full_design and for domains use the indicator variable of the domain. The function complete is not used anymore in these functions;
  2. There are functions that use the subset design object and give the same results as vardpoor
  3. There are functions that use the subset design and give results diffrent from vardpoor.

The functions in 1. have to use the full_design because they work with a threshold computed for the whole population.

It is imediate to change the functions in 2. to use the full_design. Should I do it? This would use the same style as functions in 1.

for the functions in 3, we need further investigation.

ajdamico commented 8 years ago

hi, so long as you can explain the differences with vardpoor, then you can set whichever defaults you think are best. we don't have to match vardpoor, we just have to convincingly show the technical reason for not matching.

note that you are putting a lot of (very useful) text in this github issues window -- this is all good information, so it belongs in the package help pages or in a vignette or in a readme. the replication script you sent (above) definitely should be in a vignette.. thanks

DjalmaPessoa commented 8 years ago

Shall we add to the existing vignette or create a new one?

ajdamico commented 8 years ago

new one, maybe called "comparisons"

On Tuesday, June 7, 2016, DjalmaPessoa notifications@github.com wrote:

Shall we add to the existing vignette or create a new one?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/DjalmaPessoa/convey/issues/67#issuecomment-224274125, or mute the thread https://github.com/notifications/unsubscribe/AANO54Hc2WAoFIEMzt_iddzjK5HjpY-Dks5qJWzBgaJpZM4ItL4N .

DjalmaPessoa commented 8 years ago

Yesterday I had a meeting with Pedro Silva, whom I consider to be the best expert in sampling design in Brazil. We together reviewed the formulas used in the scripts of both convey and vardpoor. The formulas are in Osier's paper for the whole population. and we derived the correponding formulas for domains. This was also done in Breidaks's paper, mentioned by me before. The scripts in vardpoor are based on Breidaks's paper.

By comparing the formulas we got for qsr domain estimation with the one in Breidaks's paper we noticed he missed the indicator of the domain as a factor in linearization. This originated the mistake implemented in vardpoor and explains the difference we found between convey and vadpoor. So I believe this solves our question.

Pedro Silva is willing to help us to test the functions in convey. He has a graduate student who is using PNAD microdata to estimate Theil indicator along the years. He asked me to include Theil indicator in the convey library. This is probably among the indicators added by Guilherme? I believe this partnership would be very useful for convey.

ajdamico commented 8 years ago

this all sounds great to me

On Wed, Jun 8, 2016 at 10:12 AM, DjalmaPessoa notifications@github.com wrote:

Yesterday I had a meeting with Pedro Silva, whom I consider to be the best expert in sampling design in Brazil. We together reviewed the formulas used in the scripts of both convey and vardpoor. The formulas are in Osier's paper for the whole population. and we derived the correponding formulas for domains. This was also done in Breidaks's paper, mentioned by me before. The scripts in vardpoor are based on Breidaks's paper.

By comparing the formulas we got for qsr domain estimation with the one in Breidaks's paper we noticed he missed the indicator of the domain as a factor in linearization. This originated the mistake implemented in vardpoor and explains the difference we found between convey and vadpoor. So I believe this solves our question.

Pedro Silva is willing to help us to test the functions in convey. He has a graduate student who is using PNAD microdata to estimate Theil indicator along the years. He asked me to include Theil indicator in the convey library. This is probably among the indicators added by Guilherme? I believe this partnership would be very useful for convey.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/DjalmaPessoa/convey/issues/67#issuecomment-224601615, or mute the thread https://github.com/notifications/unsubscribe/AANO52MSslBJDUynJJKIQ7X_Bi9SoVZCks5qJs2-gaJpZM4ItL4N .

guilhermejacob commented 8 years ago

Yes. The Theil index is the Generalized Entropy index (or the Rényi divergence measure) with epsilon = 1. But all help is welcome!

ajdamico commented 8 years ago

hi djalma, why did you close this issue? i am re-opening (but maybe you have a good reason to close it?). i thought that discussion in the thread above is still unresolved and that you might still make edits to the computation based on conversations with IBGE? note that the computations do not need to be final until 1.0.0 so this should not slow down the CRAN release of 0.1.0

DjalmaPessoa commented 8 years ago

I'm entirely convinced that we just need to use the full_design when we work with a global poverty line. For the other cases, to work with the full design and use the domain indicator is equivalent to working with the subset design. Vardpoor always uses the full_design and when computing domain estimates introduces the domain indicator. Convey does that using svyby (subset design). In the case of global poverty threshold convey uses the full_design. So, I believe this issue should be closed.

On Sat, Jun 18, 2016 at 11:29 PM, Anthony Damico notifications@github.com wrote:

Reopened #67 https://github.com/DjalmaPessoa/convey/issues/67.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/DjalmaPessoa/convey/issues/67#event-696860083, or mute the thread https://github.com/notifications/unsubscribe/AFD-pyvwEA1n-6ADAs1VChFSl7c01y5Fks5qNKmGgaJpZM4ItL4N .

ajdamico commented 8 years ago

great :)