UDST / synthpop

Synthetic populations from census data
BSD 3-Clause "New" or "Revised" License
100 stars 47 forks source link

Initial Quality Assessment #22

Open jiffyclub opened 9 years ago

jiffyclub commented 9 years ago

I recorded some of the quality data from Napa County, which is pasted below. Low chi-squared is better (ideally less than 1) and high p-value is better. (Each indicating similarity between the expected and observed distributions.) One thing that stands out here is that some block groups turn out pretty well and others don't, and that that's repeatable between runs (it's not random chance). It seems like there's something about those particular block groups that help us end up with a good fit or poor fit that'll require some more investigation. I'm open for ideas on other ways of evaluating the final quality of the synthesis.

Geography: 06 055 201403 2
    num households:  202
    household chisq: 5.43088089045
    household p:     5.21073120377e-34
    people chisq:    16.0817647772
    people p:        9.93781526248e-86
Geography: 06 055 200706 2
    num households:  314
    household chisq: 0.598180266206
    household p:     0.991095451554
    people chisq:    1.5979216509
    people p:        0.0186412607763
Geography: 06 055 201102 2
    num households:  326
    household chisq: 1.55198810451
    household p:     0.00614775663362
    people chisq:    5.09426306818
    people p:        6.05271644872e-19
Geography: 06 055 200202 2
    num households:  151
    household chisq: 1.57294117642
    household p:     0.00488296278056
    people chisq:    9.81533427718
    people p:        1.22522219051e-46
Geography: 06 055 201601 1
    num households:  473
    household chisq: 6.02547998661
    household p:     1.03611171429e-39
    people chisq:    6.29655747411
    people p:        1.01216773952e-25
Geography: 06 055 201401 1
    num households:  341
    household chisq: 3.19792886587
    household p:     4.15912318716e-14
    people chisq:    5.49037386335
    people p:        3.8045168528e-21
Geography: 06 055 200802 1
    num households:  348
    household chisq: 1.61951419488
    household p:     0.00288634250822
    people chisq:    1.82795414481
    people p:        0.00326723157248
Geography: 06 055 201403 1
    num households:  93
    household chisq: 4997.26506248
    household p:     0.0
    people chisq:    13.7349343251
    people p:        6.40236917105e-71
Geography: 06 055 201200 1
    num households:  257
    household chisq: 1.94674866466
    household p:     4.48102332262e-05
    people chisq:    2.03213659632
    people p:        0.000589862312201
Geography: 06 055 200706 3
    num households:  343
    household chisq: 1.24396606265
    household p:     0.109385822386
    people chisq:    1.56084161759
    people p:        0.024157555149
Geography: 06 055 201102 1
    num households:  477
    household chisq: 2.7121000322
    household p:     2.62111652853e-10
    people chisq:    5.16074956589
    people p:        2.59917511571e-19
Geography: 06 055 200504 2
    num households:  1185
    household chisq: 3.39998356272
    household p:     9.15364032494e-16
    people chisq:    10.994263177
    people p:        7.27205255309e-54
Geography: 06 055 200804 2
    num households:  400
    household chisq: 0.81717537662
    household p:     0.826306619365
    people chisq:    0.914239917016
    people p:        0.603490408465
Geography: 06 055 200203 1
    num households:  420
    household chisq: 0.590906858823
    household p:     0.992300753267
    people chisq:    1.20234540617
    people p:        0.202724531156
fscottfoti commented 9 years ago

Low p-value is better right? Many of the p-values are very small, but 3 seem to be quite large. Might have to look at them individually and see what's going on. Keep in mind PUMS changed their sample this year, so we have a smaller sample than is typical - this could certainly make it harder to meet marginals. Would be interesting to look at marginals and joint distribution for those block groups that perform poorly.

jiffyclub commented 9 years ago

Lower p-value is good when you're trying to prove two sets are not from the same distribution. In this case we're hoping the sets do look like the same distribution. For a goodness-of-fit test you want a low chi-squared. For example, the last geography above indicates a pretty good match between the synthetic totals and the target constraints. The first item is a poor match.

fscottfoti commented 9 years ago

Ahh - well it looks like we have a problem then ;)

On Tue, Sep 16, 2014 at 4:37 PM, Matt Davis notifications@github.com wrote:

Lower p-value is good when you're trying to prove two sets are not from the same distribution. In this case we're hoping the sets do look like the same distribution. For a goodness-of-fit test you want a low chi-squared. For example, the last geography above indicates a pretty good match between the synthetic totals and the target constraints. The first item is a poor match.

— Reply to this email directly or view it on GitHub https://github.com/synthicity/synthpop/issues/22#issuecomment-55829857.

jiffyclub commented 9 years ago

Also note that this is a "reduced" chi-squared, where I think values less than one are "pretty good" and values more than one are "not good".