ajdamico / convey

variance of distribution measures estimation of survey data
GNU General Public License v3.0
17 stars 7 forks source link

tests comparing convey and vardpoor are not working #65

Closed DjalmaPessoa closed 7 years ago

DjalmaPessoa commented 8 years ago

some tests that are working, after lots of modifications introduced in convey.

Example: run

testthat::test_file("C:/Users/owner/Documents/GitHub/convey/tests/testthat/test-svyqsr.R")

the values of the se generated by convey and vardpoor are quite different.

computing the se by replicated weights design:

svyby(~eqincome, by = ~db040, design = des_eusilc_rep, FUN = svyqsr, deff = FALSE)

we get results much closer to the generated by convey

ajdamico commented 8 years ago

@DjalmaPessoa did this work previously? in case it's of any use, you can search the entire convey commit history by opening up the git powershell and typing these commands

http://stackoverflow.com/questions/1337320/how-to-grep-git-commit-diffs-or-contents-for-a-certain-word

DjalmaPessoa commented 8 years ago

I know how to change the functions in convey in order to make the results to match with vardpoor. But that's not the point.

In fact for the whole population they always match. The problem is when working with domains.

In the reference:

http://www.revistadestatistica.ro/wp-content/uploads/2015/04/RRS2_2015_A04.pdf

Juris shows the theory behind what is done in vardpoor. They use a mathematical expression that depends on the indicator of the domain ID. After that, they use the design for the whole population, full_design, to estimate the variance. There are other theoretical papers that refers to domain estimation for linearization of poverty and concentration measures and propose the same.

The point is: when in convey we use the subset design for the domain, the point estimates we get are the same as in vardpoor, but the linearized variable and the variance estimate are different .

This has to be checked for all the functions in convey, including for the recent functions added by Guilherme, that also use linearization.

Remember that at a certain point we reached complete aggreement with vardpoor, even for domains. I changed the functions after computing estimates for the PNAD and getting results for the cv's that people at IBGE found it were very odd.

We've got to figure it out which is the correct way to work for domain estimation! After that I will feel comfortable to include the library in CRAN.

ajdamico commented 8 years ago

changing this and one other issue to v0.1.0 then, but we should aim to put something on cran soon because you and guilherme have done a lot of work and because users might have useful comments

DjalmaPessoa commented 8 years ago

If we run testthat::test_file("C:/Users/owner/Documents/GitHub/convey/tests/testthat/test-svygini.R")

there a small difference between results from convey and vardpoor. This is not expected, since the functions seem quite similar. I'll check it out.

ajdamico commented 8 years ago

i have added two additional issues related to this so that we do not forget those checks -- but they need to be disabled to put version 0.1.0 on CRAN

ajdamico commented 8 years ago

i am unclear if this issue is related to issues #95 and #96 (referenced above and removed in the commit below) or if they are something else? thanks

https://github.com/DjalmaPessoa/convey/commit/5ebe70f84d1c384b8adec54e7e4f8e1864fa615b

ajdamico commented 8 years ago

@DjalmaPessoa now that vardpoor has been fixed, could you restore the comparisons between vardpoor and convey? i believe the expect_equal() calls should now all work? thanks

https://github.com/CSBLatvia/vardpoor/issues/4#issuecomment-228745404

DjalmaPessoa commented 8 years ago

There were two functions failing the comparisons tests: svyqsr and svygini. Yesterday I finally found out why the difference. There was a mistake in svygini from convey. We need to work with the sorted weights and income but when estimating the variance we need to use the original ordering. After I fixed this, the results from convey and vardpoor matched. The tests are now passing for svygini. It is important to check if Guilherme needed this kind of sorting in some function. Note that the original order has to be used in the function survey::svyrecvar to compute variance. As for svyqsr, the results are different because the linearization used in vardpoor seems to be wrong. Perhaps we should write to them about this.

The mentioned changes in vardpoor has no relation with this.

ajdamico commented 8 years ago

thanks djalma! i'm glad we are adding all of those checks. did you see that vardpoor recently released a fix of their software? the latest version is 0.7.0 but it is not finished building on cran (as of this moment, the windows binary is still the previous version) so in order to get 0.7.0 installed on your machine you should use

devtools::install_github("CSBLatvia/vardpoor")

it should be available on cran in the next few days, though.

cc @guilhermejacob because you asked me a similar question about update() and i mentioned the danger of differently-sorted data sets

ajdamico commented 8 years ago

if vardpoor's version 0.7.0 has still not fixed the problem, then yes, i agree that you should write them a note explaining the discrepancy

DjalmaPessoa commented 8 years ago

I'm having problems to install the new version of vardpoor:

devtools::install_github("CSBLatvia/vardpoor") Downloading GitHub repo CSBLatvia/vardpoor@master from URL https://api.github.com/repos/CSBLatvia/vardpoor/zipball/master Error: Does not appear to be an R package (no DESCRIPTION)

ajdamico commented 8 years ago

the cran version now has 0.7.0 right? just install from cran

On Tuesday, July 5, 2016, DjalmaPessoa notifications@github.com wrote:

I'm having problems to install the new version of vardpoor:

devtools::install_github("CSBLatvia/vardpoor") Downloading GitHub repo CSBLatvia/vardpoor@master from URL https://api.github.com/repos/CSBLatvia/vardpoor/zipball/master Error: Does not appear to be an R package (no DESCRIPTION)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DjalmaPessoa/convey/issues/65#issuecomment-230342556, or mute the thread https://github.com/notifications/unsubscribe/AANO5wAI_fOCDv3b2ncNDEyuT0vgUNYzks5qSVj4gaJpZM4IrkCG .

ajdamico commented 8 years ago

the installation from cran should be fine, but the github install line is actually

install_github("CSBLatvia/vardpoor/vardpoor")
DjalmaPessoa commented 8 years ago

I've checked with the new version of vardpoor and for the function svysqr the se estimates for domains are diffrent

ajdamico commented 8 years ago

if possible, it would be good to provide them with a minimal reproducible example explaining the discrepancy.. last time you did this, they agreed that it was a bug on their end, and they fixed it :)

DjalmaPessoa commented 8 years ago

They are using the linearization expressions in the enclosed paper. I believe the qsr linearization is not ok in the paper. In the equations (44) and (45) the domain indicator is missing as a factor. They just implemented this formula in vardpoor. My idea is to write to Juris calling his attention on that.

On Tue, Jul 5, 2016 at 9:23 AM, Anthony Damico notifications@github.com wrote:

if possible, it would be good to provide them with a minimal reproducible example explaining the discrepancy.. last time you did this, they agreed that it was a bug on their end, and they fixed it :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DjalmaPessoa/convey/issues/65#issuecomment-230463401, or mute the thread https://github.com/notifications/unsubscribe/AFD-p6op3Kn7Z6AFk7rag3X2KiaBZEfKks5qSkzEgaJpZM4IrkCG .

ajdamico commented 8 years ago

great idea