NYCPlanning / db-acs

American Community Survey data processing for Population Fact Finder
4 stars 1 forks source link

percentage estimate set to 100 #15

Closed SPTKL closed 4 years ago

SPTKL commented 4 years ago
With a as (
select variable, count(*) as count from pff_social."Y2013-2017-old" 
where p = 100
GROUP BY variable),
b as (select variable, count(*) as count from pff_social."Y2013-2017" 
where p = 100
GROUP BY variable)
SELECT a.variable, a.count as old_count, b.count as new_count 
FROM a,b
WHERE a.variable = b.variable;

From this query we can see that for some variables we have fairly different results for the number of records with 100 as percentage estimate. image

Need to figure out if we need to set P=100 for certain variables by definition, if so, how to identify those variables

SPTKL commented 4 years ago

Also note that for many records, we have P > 100, if we use the current defined way to identify the total, e.g. CNI1864_1 for SI28

>>> lst = ['1400000US36085020700', '1400000US36085021300', '1400000US36085024700', '1400000US36085025100']
>>> df.loc[df.GEO_ID.isin(lst), ['DP02_0074E', 'DP02_0001E']]
938      3521.0      1787.0
940      1917.0      1012.0
941      4294.0      2133.0
954      3192.0      1566.0

it's clear that for each of these tracts, DP02_0074E > DP02_0001E, so do we have a special calculation rule here that we need to apply?

EricaMaurer commented 4 years ago

anything labeled a base in the data dictionary will be set to 100 in the percent column.