NYCPlanning / db-factfinder

data ETL for population fact finder (decennial + acs)
https://nycplanning.github.io/db-factfinder/factfinder/
MIT License
2 stars 3 forks source link

QAQC: 15-19 #182

Closed mgraber closed 3 years ago

mgraber commented 3 years ago

Discrepancies by percent difference

Remaining edge-cases (as of merge of #188)

e

m

p

z

EricaMaurer commented 3 years ago

e: -mdvl (all three tracts): median = 5,000. All cases fall in first bin (1-9,999) mdnfinc (36081066404): small sample, calculator was bottom-coding the data as 9,999. Fine to actually calculate the median as 72,936. MOE will still be blank.

m:

-mntrvtm (36047054300): MOE = 113.098480964074 (113.1 when rounded) -- agttmE = 150.489296636086 -- agttmM = 310.662307982163 -- wrkrnothmE = 4.34556574923548 -- wrkrnothmM = 10.9972045070823 --- we may have to troubleshoot this one more if the entire column is coming up with issues. do the inputs match?

z:

mgraber commented 3 years ago

@EricaMaurer thank you for this!

mdvl e (all three tracts)

Are these not cases of N/2 falling within an open-ended bottom-coded group, in which case we set the median as the highest value of the lowest group?

mdvl m for 36081124100

The moe in the sample data is 640612. Our values are:

DISTRIBUTION:
-----
- [0, 9999]: 0.0
- [10000, 14999]: 0.0
- [15000, 19999]: 0.0
- [20000, 24999]: 0.0
- [25000, 29999]: 0.0
- [30000, 34999]: 0.0
- [35000, 39999]: 0.0
- [40000, 49999]: 0.0
- [50000, 59999]: 0.0
- [60000, 69999]: 0.0
- [70000, 79999]: 0.0
- [80000, 89999]: 0.0
- [90000, 99999]: 0.0
- [100000, 124999]: 0.0
- [125000, 149999]: 0.0
- [150000, 174999]: 0.0
- [175000, 199999]: 0.0
- [200000, 249999]: 0.0
- [250000, 299999]: 0.0
- [300000, 399999]: 0.0
- [400000, 499999]: 0.5063291139240507
- [500000, 749999]: 52.911392405063296
- [750000, 999999]: 95.0632911392405
- [1000000, 1499999]: 97.59493670886076
- [1500000, 1999999]: 98.48101265822785
- [2000000, 5000000]: 100.0

mdfaminc m for 36005002400

We're getting both upper and lower in the same bin.

mntrvtm m for 36047054300

Our inputs do not completely match. You can see a walkthrough here, along with the function we're using for the special calculation. We're using more inputs than the ones listed, since we're first calculating wrkrnothm from wrkr16pl and cw_wrkdhm.

lgchilep1 z for 36085029105

This is an issue on our end. Our formulas get the same thing as you initially, seen here, but are somewhere getting overwritten with 0. @SPTKL any idea where this might be happening?

SPTKL commented 3 years ago

@mgraber

lgchilep1 z for 36085029105

For lgchilep1, everything was fine, till we get to get_z function. and in the condition of the third elif -> elif m ** 2 - (e * agg_m / agg_e) ** 2 < 0 shows that m ** 2 - (e * agg_m / agg_e) ** 2 = 0 instead of less than 0. so we are correct here 33.06894964327548 ** 2 - (14.580814354727398 * 49.603424464913225 / 21.8712215320911) ** 2 == 0 However, if we reduce the precision to 4 digits 33.0689 ** 2 - (14.5808 * 49.6034 / 21.8712) ** 2 = -0.002204...

we are including more digits, so our calculation is better/more precise, there's no overwriting, there's no logic error here

mgraber commented 3 years ago

@EricaMaurer In the top comment, I posted several CSVs containing the remaining discrepancies grouped by percent difference to guide our remaining review. This also gives you a chance to see the extent to which we're matching on variables I haven't brought up, to make sure they are close enough for Population's sign-off.

Similar CSVs are in the 06-10 issue #178.

For now, maybe we focus on cases still in the over 10% category?

SPTKL commented 3 years ago

@EricaMaurer mntrvtm and avghhsooc is an error on our end, it's resolved

EricaMaurer commented 3 years ago

Over 10% difference (tracts and non-tract geogs):

SPTKL commented 3 years ago
mgraber commented 3 years ago

Investigations into non-median discrepancies