mgraber commented 3 years ago

Discrepancies by percent difference

More than 10% difference, defined as ((code value - sample value)/(sample value)) * 100:
- Tract-level: pct_diff_o10_2019.csv.zip
- Above tract: pct_diff_o10_2019_nontract.csv.zip
Between 1% and 10% difference:
- Tract-level: pct_diff_1t10_2019.csv.zip
- Above tract: pct_diff_1t10_2019_nontract.csv.zip
Differences in NULLS:
- Tract-level: null_diffs_2019.csv.zip
- Above tract: null_diff_2019_nontract.csv.zip

Remaining edge-cases (as of merge of #188)

e

[x] mdvl: 36005019700, 36061002400, 36061006200
[x] mdnfinc: 36081066404 (sample: 9999, code: 72937)

m

[ ] mdvl: 36081124100
[x] mdfaminc: 36005002400
[x] mntrvtm: 36047054300, 36047044902, 36081053502, 36081053902, 36081089202, 36085002700, 36061011900, etc. #199

p

z

[x] lgchilep1: 36085029105 null_diffs_2019.csv.zip

EricaMaurer commented 3 years ago

e: -mdvl (all three tracts): median = 5,000. All cases fall in first bin (1-9,999) mdnfinc (36081066404): small sample, calculator was bottom-coding the data as 9,999. Fine to actually calculate the median as 72,936. MOE will still be blank.

m:

mdvl (36081124100): -- lower bound C1 = 0.506329113924051 -- lower bound C2= 52.9113924050633 -- lower bound A1 = 0 because (lower bound C1-1)<0.0001 -- lower bound A2 = 9999.9999 because (lower bound C1-1)<0.0001 -- Lower bound = 7712.22177392625
mdfaminc (36005002400): -- lower bound C1 = 0 (bin 12) -- lower bound C2 = 100 (bin 13) -- lower bound A1 = 0 (C1 < 0.0001) -- lower bound A2 = 9999.9999 (C1 < 0.0001) -- upper bound C1 = 0 (bin 12) -- upper bound C2 = 100 (bin 13) -- upper bound A1 = 100,000 (bin 12) -- upper bound A2 = 124,999.9999 (bin 13)

-mntrvtm (36047054300): MOE = 113.098480964074 (113.1 when rounded) -- agttmE = 150.489296636086 -- agttmM = 310.662307982163 -- wrkrnothmE = 4.34556574923548 -- wrkrnothmM = 10.9972045070823 --- we may have to troubleshoot this one more if the entire column is coming up with issues. do the inputs match?

z:

lgchilep1 (36085029105) -- lgchi1E = 21.8712215320911 -- lgchi1M = 49.6034244649132 -- lgchilep1E = 14.5808143547274 -- lgchilep1M = 33.0689496432755 --- using same formula as the rest of the column so may have to dig into inputs

mgraber commented 3 years ago

@EricaMaurer thank you for this!

mdvl e (all three tracts)

Are these not cases of N/2 falling within an open-ended bottom-coded group, in which case we set the median as the highest value of the lowest group?

mdvl m for 36081124100

The moe in the sample data is 640612. Our values are:

Median = 736110.1666666666
Median_MOE = 77122.9769558513
B = 790.0
se_50 = 9.07772346412575
p_lower = 40.922276535874246
p_upper = 59.077723464125754
lower_bin = 21
upper_bin = 22
first_non_zero_bin = 20

DISTRIBUTION:
-----
- [0, 9999]: 0.0
- [10000, 14999]: 0.0
- [15000, 19999]: 0.0
- [20000, 24999]: 0.0
- [25000, 29999]: 0.0
- [30000, 34999]: 0.0
- [35000, 39999]: 0.0
- [40000, 49999]: 0.0
- [50000, 59999]: 0.0
- [60000, 69999]: 0.0
- [70000, 79999]: 0.0
- [80000, 89999]: 0.0
- [90000, 99999]: 0.0
- [100000, 124999]: 0.0
- [125000, 149999]: 0.0
- [150000, 174999]: 0.0
- [175000, 199999]: 0.0
- [200000, 249999]: 0.0
- [250000, 299999]: 0.0
- [300000, 399999]: 0.0
- [400000, 499999]: 0.5063291139240507
- [500000, 749999]: 52.911392405063296
- [750000, 999999]: 95.0632911392405
- [1000000, 1499999]: 97.59493670886076
- [1500000, 1999999]: 98.48101265822785
- [2000000, 5000000]: 100.0

mdfaminc m for 36005002400

We're getting both upper and lower in the same bin.

Median = 112499.5
Median_MOE = 7295.042420334079
B = 38.0
se_50 = 44.346762433641814
p_lower = 5.653237566358186
p_upper = 94.34676243364181
lower_bin = 12
upper_bin = 12
first_non_zero_bin = 12
upper_bin and lower_bin are in the first non-zero bin
UPPER_BOUND:
```
    A1=0, A2=10000, C1=0.0, C2=100.0
```
LOWER_BOUND:
```
    A1=0, A2=10000, C1=0.0, C2=100.0
```
CUMULATIVE DISTRIBUTION:
- 0, 9999: 0.0
- 10000, 14999: 0.0
- 15000, 19999: 0.0
- 20000, 24999: 0.0
- 25000, 29999: 0.0
- 30000, 34999: 0.0
- 35000, 39999: 0.0
- 40000, 44999: 0.0
- 45000, 49999: 0.0
- 50000, 59999: 0.0
- 60000, 74999: 0.0
- 75000, 99999: 0.0
- 100000, 124999: 100.0
- 125000, 149999: 100.0
- 150000, 199999: 100.0
- 200000, 9999999: 100.0

mntrvtm m for 36047054300

Our inputs do not completely match. You can see a walkthrough here, along with the function we're using for the special calculation. We're using more inputs than the ones listed, since we're first calculating wrkrnothm from wrkr16pl and cw_wrkdhm.

lgchilep1 z for 36085029105

This is an issue on our end. Our formulas get the same thing as you initially, seen here, but are somewhere getting overwritten with 0. @SPTKL any idea where this might be happening?

SPTKL commented 3 years ago

@mgraber

lgchilep1 z for 36085029105

For lgchilep1, everything was fine, till we get to get_z function. and in the condition of the third elif -> elif m ** 2 - (e * agg_m / agg_e) ** 2 < 0 shows that m ** 2 - (e * agg_m / agg_e) ** 2 = 0 instead of less than 0. so we are correct here 33.06894964327548 ** 2 - (14.580814354727398 * 49.603424464913225 / 21.8712215320911) ** 2 == 0 However, if we reduce the precision to 4 digits 33.0689 ** 2 - (14.5808 * 49.6034 / 21.8712) ** 2 = -0.002204...

we are including more digits, so our calculation is better/more precise, there's no overwriting, there's no logic error here

mgraber commented 3 years ago

@EricaMaurer In the top comment, I posted several CSVs containing the remaining discrepancies grouped by percent difference to guide our remaining review. This also gives you a chance to see the extent to which we're matching on variables I haven't brought up, to make sure they are close enough for Population's sign-off.

Similar CSVs are in the 06-10 issue #178.

For now, maybe we focus on cases still in the over 10% category?

SPTKL commented 3 years ago

@EricaMaurer mntrvtm and avghhsooc is an error on our end, it's resolved

EricaMaurer commented 3 years ago

Over 10% difference (tracts and non-tract geogs):

issues noted in attached spreadsheet.
seems like a general rounding issue for many of the issues-- rounding to different digits, rounding up for .5 vs not, etc. Cleaning this issue up will likely take away some from the under 10% difference check. #200
was the latest economic upload incorporated? saw a few issues that should have been resolved with that change to the median formula on my end.
noted a few blank MOEs in this sheet that should have values. pct_diff_o10_2019_allgeogs.xlsx

SPTKL commented 3 years ago

the only 2019 remaining mdroms/mdvl moe calculation
all these are because p_lower and p_upper are both in first non-zero bin 36005002400 mdrms 6 6 0 1.1 0.7 0.5714285714285716
36061000500 mdrms 4 4 0 1.1 0.7 0.5714285714285716
36061011900 mdrms 1 1.2 0.16666666666666663 1.3 0.2 5.5
36085002100 mdrms 4.5 4.5 0 0.4 0.5 0.19999999999999996
BX0991 mdrms 6 6 0 1.1 0.7 0.5714285714285716
MN0191 mdrms 4 4 0 1.1 0.7 0.5714285714285716
36081124100 mdvl 736110 736110 0 77123 640612 0.8796104350215107
referencing the following distribution, we are correct it seems like population is putting p_lower in the bottom bin following the first non-zero bin rule

mgraber commented 3 years ago

Investigations into non-median discrepancies

avghhsooc p is not NULL because of data cleaning step where p for non-median base variables is set to 100. https://github.com/NYCPlanning/db-factfinder/blob/41a7f2f9283de40e367f0332704d74a69c1cfd5a/factfinder/calculate.py#L341-L345 Up until this last-minute cleaning exception, avghhsooc p is NULL (it calculated in this if-statement). This is because the variable is a special variable, the variable serves as a base variable for another variable, and the geoid is not city or borough. What other exceptions (beyond medians) do we need to include when setting p to 100 for base variables? PR #204 is our best guess, which also excludes special variables.
Cases of missing MOE seem to stem from the 2010 to 2020 conversion, where the 2020 output tract has more than one input tract. For example, the 2020 geoid 36085009702 comes from 2010 geoids 36085008100 and 36085009700, with ratios of roughly 0.000698 and 0.000000 respectively. When converting the MOEs for 36085008100 and 36085009700, one is NULL (36085009700, because its ratio is 0). As it is currently implemented, the sum-of-squares formula for combining a NULL moe with a non-NULL moe to roll up the converted input tracts into a single record for 2020 geoid 36085009702 results in NULL.
```
math.sqrt(sum([i ** 2 for i in [np.nan, 2.494853]])) = np.nan
```
Should we ignore NULLs in the sum-of-squares formula? PR #202 implements this change for all aggregated geographies. Should this only apply to combining 2010 tracts into the merged 2020 tracts?

NYCPlanning / db-factfinder

QAQC: 15-19 #182

Discrepancies by percent difference

Remaining edge-cases (as of merge of #188)

e

m

p

z

mdvl e (all three tracts)

mdvl m for 36081124100

mdfaminc m for 36005002400

UPPER_BOUND:

LOWER_BOUND:

mntrvtm m for 36047054300

lgchilep1 z for 36085029105

lgchilep1 z for 36085029105

Investigations into non-median discrepancies