Closed mgraber closed 3 years ago
e: -mdvl (all three tracts): median = 5,000. All cases fall in first bin (1-9,999) mdnfinc (36081066404): small sample, calculator was bottom-coding the data as 9,999. Fine to actually calculate the median as 72,936. MOE will still be blank.
m:
mdvl (36081124100): -- lower bound C1 = 0.506329113924051 -- lower bound C2= 52.9113924050633 -- lower bound A1 = 0 because (lower bound C1-1)<0.0001 -- lower bound A2 = 9999.9999 because (lower bound C1-1)<0.0001 -- Lower bound = 7712.22177392625
mdfaminc (36005002400): -- lower bound C1 = 0 (bin 12) -- lower bound C2 = 100 (bin 13) -- lower bound A1 = 0 (C1 < 0.0001) -- lower bound A2 = 9999.9999 (C1 < 0.0001) -- upper bound C1 = 0 (bin 12) -- upper bound C2 = 100 (bin 13) -- upper bound A1 = 100,000 (bin 12) -- upper bound A2 = 124,999.9999 (bin 13)
-mntrvtm (36047054300): MOE = 113.098480964074 (113.1 when rounded) -- agttmE = 150.489296636086 -- agttmM = 310.662307982163 -- wrkrnothmE = 4.34556574923548 -- wrkrnothmM = 10.9972045070823 --- we may have to troubleshoot this one more if the entire column is coming up with issues. do the inputs match?
z:
@EricaMaurer thank you for this!
Are these not cases of N/2 falling within an open-ended bottom-coded group, in which case we set the median as the highest value of the lowest group?
The moe in the sample data is 640612. Our values are:
DISTRIBUTION:
-----
- [0, 9999]: 0.0
- [10000, 14999]: 0.0
- [15000, 19999]: 0.0
- [20000, 24999]: 0.0
- [25000, 29999]: 0.0
- [30000, 34999]: 0.0
- [35000, 39999]: 0.0
- [40000, 49999]: 0.0
- [50000, 59999]: 0.0
- [60000, 69999]: 0.0
- [70000, 79999]: 0.0
- [80000, 89999]: 0.0
- [90000, 99999]: 0.0
- [100000, 124999]: 0.0
- [125000, 149999]: 0.0
- [150000, 174999]: 0.0
- [175000, 199999]: 0.0
- [200000, 249999]: 0.0
- [250000, 299999]: 0.0
- [300000, 399999]: 0.0
- [400000, 499999]: 0.5063291139240507
- [500000, 749999]: 52.911392405063296
- [750000, 999999]: 95.0632911392405
- [1000000, 1499999]: 97.59493670886076
- [1500000, 1999999]: 98.48101265822785
- [2000000, 5000000]: 100.0
We're getting both upper and lower in the same bin.
A1=0, A2=10000, C1=0.0, C2=100.0
A1=0, A2=10000, C1=0.0, C2=100.0
Our inputs do not completely match. You can see a walkthrough here, along with the function we're using for the special calculation. We're using more inputs than the ones listed, since we're first calculating wrkrnothm from wrkr16pl and cw_wrkdhm.
This is an issue on our end. Our formulas get the same thing as you initially, seen here, but are somewhere getting overwritten with 0. @SPTKL any idea where this might be happening?
@mgraber
For lgchilep1, everything was fine, till we get to get_z
function. and in the condition of the third elif -> elif m ** 2 - (e * agg_m / agg_e) ** 2 < 0
shows that m ** 2 - (e * agg_m / agg_e) ** 2 = 0
instead of less than 0. so we are correct here
33.06894964327548 ** 2 - (14.580814354727398 * 49.603424464913225 / 21.8712215320911) ** 2 == 0
However, if we reduce the precision to 4 digits
33.0689 ** 2 - (14.5808 * 49.6034 / 21.8712) ** 2 = -0.002204...
we are including more digits, so our calculation is better/more precise, there's no overwriting, there's no logic error here
@EricaMaurer In the top comment, I posted several CSVs containing the remaining discrepancies grouped by percent difference to guide our remaining review. This also gives you a chance to see the extent to which we're matching on variables I haven't brought up, to make sure they are close enough for Population's sign-off.
Similar CSVs are in the 06-10 issue #178.
For now, maybe we focus on cases still in the over 10% category?
@EricaMaurer mntrvtm and avghhsooc is an error on our end, it's resolved
Over 10% difference (tracts and non-tract geogs):
the only 2019 remaining mdroms/mdvl moe calculation
all these are because p_lower and p_upper are both in first non-zero bin
36005002400 mdrms 6 6 0 1.1 0.7 0.5714285714285716
36061000500 mdrms 4 4 0 1.1 0.7 0.5714285714285716
36061011900 mdrms 1 1.2 0.16666666666666663 1.3 0.2 5.5
36085002100 mdrms 4.5 4.5 0 0.4 0.5 0.19999999999999996
BX0991 mdrms 6 6 0 1.1 0.7 0.5714285714285716
MN0191 mdrms 4 4 0 1.1 0.7 0.5714285714285716
36081124100 mdvl 736110 736110 0 77123 640612 0.8796104350215107
referencing the following distribution, we are correct
it seems like population is putting p_lower in the bottom bin following the first non-zero bin rule
avghhsooc p is not NULL because of data cleaning step where p for non-median base variables is set to 100. https://github.com/NYCPlanning/db-factfinder/blob/41a7f2f9283de40e367f0332704d74a69c1cfd5a/factfinder/calculate.py#L341-L345 Up until this last-minute cleaning exception, avghhsooc p is NULL (it calculated in this if-statement). This is because the variable is a special variable, the variable serves as a base variable for another variable, and the geoid is not city or borough. What other exceptions (beyond medians) do we need to include when setting p to 100 for base variables? PR #204 is our best guess, which also excludes special variables.
Cases of missing MOE seem to stem from the 2010 to 2020 conversion, where the 2020 output tract has more than one input tract. For example, the 2020 geoid 36085009702 comes from 2010 geoids 36085008100 and 36085009700, with ratios of roughly 0.000698 and 0.000000 respectively. When converting the MOEs for 36085008100 and 36085009700, one is NULL (36085009700, because its ratio is 0). As it is currently implemented, the sum-of-squares formula for combining a NULL moe with a non-NULL moe to roll up the converted input tracts into a single record for 2020 geoid 36085009702 results in NULL.
math.sqrt(sum([i ** 2 for i in [np.nan, 2.494853]])) = np.nan
Should we ignore NULLs in the sum-of-squares formula? PR #202 implements this change for all aggregated geographies. Should this only apply to combining 2010 tracts into the merged 2020 tracts?
Discrepancies by percent difference
More than 10% difference, defined as ((code value - sample value)/(sample value)) * 100:
Between 1% and 10% difference:
Differences in NULLS:
Remaining edge-cases (as of merge of #188)
e
m
p
z