AyOK-Code / oscn

1 stars 1 forks source link

Census Importer #247

Closed sgelbart closed 2 months ago

sgelbart commented 3 months ago

Work Done:

Census API Wrapper - https://app.asana.com/0/1206959369083189/1207304790278735/f Build out census Schema - https://app.asana.com/0/1206959369083189/1207304790278737/f Census Data Importers: Statistic - https://app.asana.com/0/1206959369083189/1207304790278739/f Census Data Importers: Data - https://app.asana.com/0/1206959369083189/1207304790278741/f

Sample calls

# import all the statistic details and data for all the statistics in the DP03 group for all the zips in Oklahoma County for 2021
# DP03 is a "profile" table meaning that they are for more general statistics
Importers::Census::Import.new(
          Census::Survey::ACS5,
          2021,
          group: 'DP03',
          zips: ZipCode::ZIPS_OKLAHOMA_COUNTY,
          table_type: 'profile'
)

#  import the statistic details and data for Oklahoma county for the 'B25039_002E' variable in 2021
Importers::Census::Import.new(
          Census::Survey::ACS5,
          2021,
          variables: ['B25039_002E'],
          county_names: [County::OKLAHOMA]
 )

Useful SQL for testing:

select distinct on (label) census_surveys.name,
                           census_surveys.year,
                           census_statistics.name,
                           label,
                           min_data.amount    as min_amount,
                           min_zip_codes.name as min_zip,
                           max_data.amount    as max_amount,
                           max_zip_codes.name as max_zip
from census_statistics
         join (select distinct on (statistic_id) *
               from census_data
               where amount::decimal > 1
               order by statistic_id, amount::decimal asc) as min_data on census_statistics.id = min_data.statistic_id
         join (select distinct on (statistic_id) *
               from census_data
               order by statistic_id, amount::decimal desc) as max_data on census_statistics.id = max_data.statistic_id
         join census_surveys on census_statistics.survey_id = census_surveys.id
         join zip_codes min_zip_codes on min_data.area_id = min_zip_codes.id
         join zip_codes max_zip_codes on max_data.area_id = max_zip_codes.id
order by label
github-actions[bot] commented 3 months ago

Simplecov Coverage

group name covered percent covered lines lines of code
Total 58.316 1974 3385
sgelbart commented 3 months ago

@holdenmitchell regarding zips - they're currently created using upsert. I did it this way because I planned on using the wildcard * to pull in all zips for whole counties at a time, in which case that table would just be self populating based on whichever counties Austin wanted zip data for long term. That list is also zip code tabulations (not might not be all zip codes).

I didn't use the wildcard in the end so I just have the list of OKC zips hardcoded as a constant (ZipCode:LZIPS_OKLAHOMA_COUNTY). It's not much harder to add the wildcard functionality, or I could add a column for counties and seed the DB with that, or both, whatever you prefer.

Also added a sample call to the PR. There's sample code in the import_spec.rb file as well (not the clearest spot). Can add to the readme too if you want.

holdenmitchell commented 2 months ago

@sgelbart Thanks! I will get this merged in after I run a couple of imports