Closed ajdamico closed 8 years ago
To make more clear what I mean, look for in all the functions of the library convey the design object used in the function svyrecvar. Only the functions: svyarpr svyartpt svypoormed svyrmpg involving the threshold arpt (based on the whole population) use the object full_design. The se estimates for these functions match the estimates from vardpoor.
All the other functions use design which is the subset design in the case of domain estimates.
In this case, at least for the functions contained in vardpoor, the results do not match.
The vardpoor library works all the time, by using indicators, only with variables having the full length and the full_design, even for domain estimation.
Note that this is issue is not relevant for svrep designs. By comparing se linearized estimates generated by convey and by vardpoor with those generated by the svrep design, we see that convey estimates are much closer.
I believe it would also be important to get examples from other sources to compare, such as published results that we would try to replicate.
I don't agree with the suggestion of including the two alternatives in each function, because certainly one of them must be wrong! So far I was not able to spot which one it is.
Here is a reproducible script to compare vardpoor and survey to estimate qsr:
library(convey)
library(vardpoor)
data(eusilc) ; names( eusilc ) <- tolower( names( eusilc ) )
library(survey)
des_eusilc <- svydesign(ids = ~rb030, strata =~db040, weights = ~rb050, data = eusilc)
dati <- data.table(IDd=1:nrow(eusilc),eusilc)
## estimate qsr for the whole population using library vardpoor
dd <- linqsr(Y="eqincome", id="IDd", weight="rb050",Dom=NULL, dataset= dati, alpha=20)
## estimate qsr for the whole population using library convey
ddcon<-svyqsr( ~eqincome , design = des_eusilc, upper_tot = TRUE, lower_tot = TRUE )
# compare the linearized variable obtained by the two libraries
all.equal(attr(ddcon,"lin"), dd$lin$lin_qsr)
## Estimate qsr for domain using the library vardpoor
dd.dom <- linqsr(Y="eqincome", id="IDd", weight="rb050", Dom="db040", dataset= dati, alpha=20)
## the linearized qsr variable for the domain "Tyrol" using vardpoor
linvardpoor <- dd.dom$lin$lin_qsr__db040.Tyrol
## estimate qsr for a the domain "Tyrol" using conevy
ddconTyrol <- svyqsr( ~eqincome, ~db040, design =subset( des_eusilc,db040=="Tyrol" ),
alpha=.20)
## the linearized qsr variable for the domain "Tyrol" using convey
lindom.conv <-attr(ddconTyrol, "lin")
## function to estimate the se of the total of a vector t using design des:
SE_lin2 <- function(t,des){
variance<-survey::svyrecvar(t/des$prob, des$cluster,des$strata, des$fpc,
postStrata = des$postStrata)
sqrt(variance)
}
# se esimate of qsr for domain Tyrol using vardpoor
SE_lin2(linvardpoor,des_eusilc )
# se esimate of qsr for domain Tyrol using vardpoor
SE_lin2(lindom.conv, subset( des_eusilc,db040=="Tyrol" ) )
# these are different!
Sorry I meant "Here is a reproducible script to compare vardpoor and convey to estimate qsr:"
hi djalma, i believe choosing between full vs subset depends on what you find in the texts you are attempting to match. it would be great if you can also provide the vardpoor people with a reproducible example that matches their output but does not match textbook published docs, since they have been helpful for us
note that this choice might be similar to the
options("survey.replicates.mse")
in the sense that it defaults to the setting that the author deems most appropriate but can be very easily toggled on/off to match other publications. thomas lumley defaults mse=FALSE but the us census bureau always uses mse=TRUE
To understand better what is done in vardpoor to estimate se, consider: library(convey) library(vardpoor) data(eusilc) ; names( eusilc ) <- tolower( names( eusilc ) ) library(survey)
compute the se estimate of the qsr for the domain Tyrol using vardpoor, use the function linQSRCalc
vardpoor:::linQSRCalc(eusilc$eqincome,ids = 1:nrow(eusilc), weights=eusilc$rb050, ind = (eusilc$db040=="Tyrol")*1, alpha = 20) -> est_Tyrol_vardpoor
Now create a new design with weights 0 where db040 =="Tyrol"
des_Tyrol <- svydesign(ids = ~rb030, strata =~db040,
weights = ~I(rb050* (eusilc$db040=="Tyrol")), data = eusilc)
function to compute the se estimate
SE_lin2 <- function(t,des){ variance<-survey::svyrecvar(t/des$prob, des$cluster,des$strata, des$fpc, postStrata = des$postStrata) sqrt(variance) }
compute se for vardpoor: SE_lin2(est_Tyrol_vardpoor$lin$lin, des_eusilc)
compute for convey SE_lin2(attr(lin_w0_Tyrol,"lin"), des_eusilc)
get the same results using the original design!
note also that:
svyqsr( ~eqincome , design = des_Tyrol )
and
svyqsr(~eqincome, subset(des_eusilc, db040=="Tyrol"))
give the same results
thanks for this code! while you continue deciding which way convey
should behave by default, i will attempt a well-documented comparison example between convey
and vardpoor
that makes it easy for users to understand the difference
I'll be also working in examples starting from very basic estimators: total, ratio, quantile,... to see the difference. Tomorrow I'll be meeting experts from IBGE that could help to clarify this question.
comparing se estimates for the domain "Tyrol"
library(convey)
library(vardpoor)
data(eusilc) ; names( eusilc ) <- tolower( names( eusilc ) )
library(survey)
function to estimate the se
SE_lin2 <- function(t,des){
variance<-survey::svyrecvar(t/des$prob, des$cluster,des$strata, des$fpc,
postStrata = des$postStrata)
sqrt(variance)
}
des_eusilc <- svydesign(ids = ~rb030, strata =~db040, weights = ~rb050, data = eusilc)
des_eusilc_Tyrol <- subset(des_eusilc, db040=="Tyrol")
total
1. convey
incvar <- des_eusilc_Tyrol$variables$eqincome
lin.tot.conv <- incvar
SE_lin2 (lin.tot.conv, des_eusilc_Tyrol)
2 . vardpoor
ncom<- names(des_eusilc$prob)
ind <- names(des_eusilc_Tyrol$prob)
ID <- rep(1, nrow(des_eusilc))* (ncom %in% ind)
incvec <- des_eusilc$variables$eqincome
lin.tot.vardpoor <- incvec*ID
SE_lin2 (lin.tot.vardpoor, des_eusilc)
check using survey
SE(svytotal(~eqincome, des_eusilc_Tyrol))
mean for domain or ratio:
# 1. convey
num <- incvar
den <- rep(1, nrow(des_eusilc_Tyrol))
Y <- sum(num* weights(des_eusilc_Tyrol))
X <- sum(den* weights(des_eusilc_Tyrol))
R <- Y/X
linearized variable
lin.ratio.conv <- (1/X)* (num-R*den )
SE_lin2 (lin.ratio.conv, des_eusilc_Tyrol)
2. vardpoor
num <- incvec*ID
den <- rep(1, nrow(des_eusilc))*ID
Y <- sum(num* weights(des_eusilc))
X <- sum(den* weights(des_eusilc))
R <- Y/X
linearized variable
lin.ratio.vardpoor <- (1/X)* (num-R*den )
SE_lin2 (lin.ratio.vardpoor, des_eusilc)
check by survey
SE(svymean(~eqincome,des_eusilc_Tyrol ))
median
convey
alpha <- .5
h <- h_fun(incvar, weights(des_eusilc_Tyrol))
q_alpha <- survey::svyquantile(~eqincome, des_eusilc_Tyrol, .5,
method = "constant")
q_alpha <- as.vector(q_alpha)
Fprime <- densfun( ~eqincome , des_eusilc_Tyrol, q_alpha,h, fun = "F")
N <- sum ( weights(des_eusilc_Tyrol))
lin.quant.conv <- -(1/(N * Fprime)) * ((incvar <= q_alpha) - alpha)
se estimate
SE_lin2 (lin.quant.conv, des_eusilc_Tyrol)
vardpoor
lin.quant.vardpoor <- -(1/(N * Fprime)) *ID* ((incvec <= q_alpha) - alpha)
SE_lin2 (lin.quant.vardpoor, des_eusilc)
Remarks: in all examples the results match. In these particular case, the linearized variable for vardpoor the ID always enter as a factor. There are formulas for the linearized variable in vardpoor where this is not the case. When this hapen, the se estimates are different!
djalma-- you should put this in a vignette so that we don't lose this great work!
On Mon, Jun 6, 2016 at 9:25 AM, DjalmaPessoa notifications@github.com wrote:
comparing se estimates for the domain "Tyrol"
library(convey) library(vardpoor) data(eusilc) ; names( eusilc ) <- tolower( names( eusilc ) ) library(survey)
function to estimate the se SE_lin2 <- function(t,des){ variance<-survey::svyrecvar(t/des$prob, des$cluster,des$strata, des$fpc, postStrata = des$postStrata) sqrt(variance) }
des_eusilc <- svydesign(ids = ~rb030, strata =~db040, weights = ~rb050, data = eusilc)
des_eusilc_Tyrol <- subset(des_eusilc, db040=="Tyrol")
total
convey incvar <- des_eusilc_Tyrol$variables$eqincome lin.tot.conv <- incvar SE_lin2 (lin.tot.conv, des_eusilc_Tyrol)
2 . vardpoor ncom<- names(des_eusilc$prob) ind <- names(des_eusilc_Tyrol$prob) ID <- rep(1, nrow(des_eusilc))* (ncom %in% ind) incvec <- des_eusilc$variables$eqincome lin.tot.vardpoor <- incvec*ID SE_lin2 (lin.tot.vardpoor, des_eusilc)
check using survey
SE(svytotal(~eqincome, des_eusilc_Tyrol))
mean for domain or ratio:
1. convey
num <- incvar den <- rep(1, nrow(des_eusilc_Tyrol)) Y <- sum(num* weights(des_eusilc_Tyrol)) X <- sum(den* weights(des_eusilc_Tyrol)) R <- Y/X linearized variable lin.ratio.conv <- (1/X)* (num-R*den ) SE_lin2 (lin.ratio.conv, des_eusilc_Tyrol)
- vardpoor
num <- incvec_ID den <- rep(1, nrow(des_eusilc))_ID Y <- sum(num* weights(des_eusilc)) X <- sum(den* weights(des_eusilc)) R <- Y/X linearized variable lin.ratio.vardpoor <- (1/X)* (num-R*den ) SE_lin2 (lin.ratio.vardpoor, des_eusilc)
check by survey
SE(svymean(~eqincome,des_eusilc_Tyrol ))
median
convey alpha <- .5 h <- h_fun(incvar, weights(des_eusilc_Tyrol)) q_alpha <- survey::svyquantile(~eqincome, des_eusilc_Tyrol, .5, method = "constant") q_alpha <- as.vector(q_alpha) Fprime <- densfun( ~eqincome , des_eusilc_Tyrol, q_alpha,h, fun = "F") N <- sum ( weights(des_eusilc_Tyrol)) lin.quant.conv <- -(1/(N * Fprime)) * ((incvar <= q_alpha) - alpha)
se estimate
SE_lin2 (lin.quant.conv, des_eusilc_Tyrol)
vardpoor
lin.quant.vardpoor <- -(1/(N * Fprime)) ID ((incvec <= q_alpha) - alpha)
SE_lin2 (lin.quant.vardpoor, des_eusilc)
Remarks: in all examples the results match. In these particular case, the linearized variable for vardpoor the ID always enter as a factor. There are formulas for the linearized variable in vardpoor where this is not the case. When this hapen, the se estimates are different!
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/DjalmaPessoa/convey/issues/67#issuecomment-223957696, or mute the thread https://github.com/notifications/unsubscribe/AANO5xsfZPp-A678GBOFQKRhfpUoZd_fks5qJB_BgaJpZM4ItL4N .
the present situation is the following:
The functions in 1. have to use the full_design because they work with a threshold computed for the whole population.
It is imediate to change the functions in 2. to use the full_design. Should I do it? This would use the same style as functions in 1.
for the functions in 3, we need further investigation.
hi, so long as you can explain the differences with vardpoor, then you can set whichever defaults you think are best. we don't have to match vardpoor, we just have to convincingly show the technical reason for not matching.
note that you are putting a lot of (very useful) text in this github issues window -- this is all good information, so it belongs in the package help pages or in a vignette or in a readme. the replication script you sent (above) definitely should be in a vignette.. thanks
Shall we add to the existing vignette or create a new one?
new one, maybe called "comparisons"
On Tuesday, June 7, 2016, DjalmaPessoa notifications@github.com wrote:
Shall we add to the existing vignette or create a new one?
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/DjalmaPessoa/convey/issues/67#issuecomment-224274125, or mute the thread https://github.com/notifications/unsubscribe/AANO54Hc2WAoFIEMzt_iddzjK5HjpY-Dks5qJWzBgaJpZM4ItL4N .
Yesterday I had a meeting with Pedro Silva, whom I consider to be the best expert in sampling design in Brazil. We together reviewed the formulas used in the scripts of both convey and vardpoor. The formulas are in Osier's paper for the whole population. and we derived the correponding formulas for domains. This was also done in Breidaks's paper, mentioned by me before. The scripts in vardpoor are based on Breidaks's paper.
By comparing the formulas we got for qsr domain estimation with the one in Breidaks's paper we noticed he missed the indicator of the domain as a factor in linearization. This originated the mistake implemented in vardpoor and explains the difference we found between convey and vadpoor. So I believe this solves our question.
Pedro Silva is willing to help us to test the functions in convey. He has a graduate student who is using PNAD microdata to estimate Theil indicator along the years. He asked me to include Theil indicator in the convey library. This is probably among the indicators added by Guilherme? I believe this partnership would be very useful for convey.
this all sounds great to me
On Wed, Jun 8, 2016 at 10:12 AM, DjalmaPessoa notifications@github.com wrote:
Yesterday I had a meeting with Pedro Silva, whom I consider to be the best expert in sampling design in Brazil. We together reviewed the formulas used in the scripts of both convey and vardpoor. The formulas are in Osier's paper for the whole population. and we derived the correponding formulas for domains. This was also done in Breidaks's paper, mentioned by me before. The scripts in vardpoor are based on Breidaks's paper.
By comparing the formulas we got for qsr domain estimation with the one in Breidaks's paper we noticed he missed the indicator of the domain as a factor in linearization. This originated the mistake implemented in vardpoor and explains the difference we found between convey and vadpoor. So I believe this solves our question.
Pedro Silva is willing to help us to test the functions in convey. He has a graduate student who is using PNAD microdata to estimate Theil indicator along the years. He asked me to include Theil indicator in the convey library. This is probably among the indicators added by Guilherme? I believe this partnership would be very useful for convey.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/DjalmaPessoa/convey/issues/67#issuecomment-224601615, or mute the thread https://github.com/notifications/unsubscribe/AANO52MSslBJDUynJJKIQ7X_Bi9SoVZCks5qJs2-gaJpZM4ItL4N .
Yes. The Theil index is the Generalized Entropy index (or the Rényi divergence measure) with epsilon = 1
. But all help is welcome!
hi djalma, why did you close this issue? i am re-opening (but maybe you have a good reason to close it?). i thought that discussion in the thread above is still unresolved and that you might still make edits to the computation based on conversations with IBGE? note that the computations do not need to be final until 1.0.0 so this should not slow down the CRAN release of 0.1.0
I'm entirely convinced that we just need to use the full_design when we work with a global poverty line. For the other cases, to work with the full design and use the domain indicator is equivalent to working with the subset design. Vardpoor always uses the full_design and when computing domain estimates introduces the domain indicator. Convey does that using svyby (subset design). In the case of global poverty threshold convey uses the full_design. So, I believe this issue should be closed.
On Sat, Jun 18, 2016 at 11:29 PM, Anthony Damico notifications@github.com wrote:
Reopened #67 https://github.com/DjalmaPessoa/convey/issues/67.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/DjalmaPessoa/convey/issues/67#event-696860083, or mute the thread https://github.com/notifications/unsubscribe/AFD-pyvwEA1n-6ADAs1VChFSl7c01y5Fks5qNKmGgaJpZM4ItL4N .
great :)
RE paragraph below, if you are not sure which computation to use, then please add a parameter to all functions (not just svygini) that uses full_design or just the subset depending on whether the user specifies TRUE or FALSE for the function parameter?
please also include as much discussion in the ?help for the theoretical basis for one choice vs another?
After that, I decided to compute variances using the object "full_design" only for indicators involving the threshold apt which is computed for the whole population. For the other indicator it is used "design", which for domains is a subset design. Now for these indicators I'm getting values of the "se" quite different from vardpoor.