WhiteHouse / ndoch-hackathon

15 stars 14 forks source link

Census & We the People Mashup #2

Closed jmandzik closed 11 years ago

jmandzik commented 11 years ago

I just watched the video presentations from the previous Hackathon and I'm beginning to think twice about Yet-Another-Map® as the only major feature. On that same trajectory but not been done before to my knowledge, is marrying petitions to geospatial and socioeconomic datasets. I'm looking into it today, but the general idea is signature zip code correlates to ZCTA, which we should be able to query from the census API. From there, I should be able to tell you race, median income, ages, etc.

I'm imaging a dashboard tool that has a map (with clickable areas?) that will pull up some dimensional charting about the area in question. By tonight I should have an idea of how this all fits together from a data perspective, ideally with a few charts to show metrics about zip codes that came from the WTP sig data.

I'm flexible in terms of tech stacks for anyone that wants to help, but left to my own devices I'm going to start with an AngularJS, d3, and Leaflet/GoogleMaps on the client side with a node.js backend.

jmandzik commented 11 years ago

I'll be available via Google+ most nights if anyone wants to talk shop. Hit me up at justin.mandzik@gmail.com... don't be shy :)

jmandzik commented 11 years ago

I abandoned ZCTA for a few accuracy reasons, but mostly because I found a simpler path to Census info:

Signature ZIP -> Geocode to Lat/Long (via Google maps?) Lat/Long -> FCC Census Block conversion API yielding a FIPS number FIPS number -> Census API

Precomputing the lat/long for zip would certainly speed execution up, same for FIPS. Maybe for the demo purposes, pre-grab the FIPS for popular zip codes. We'll need a DB for some of the other statistical analysis anyway, so wouldn't be hard to extend the WTP bulk data tables to store some of this. I'm importing the data now (around ~1.1gb), but the thought process in sql pseudo-code is:

select petition, zips, fips, count(*) from petition_table inner join sigs_table group by petition, zips

If a hypothetical petition has 1/2 a dozen zips that make up statistically significant portions of the signature base, we can query the census API to make guesses about income, education, and the rest of it.

All this boils down to answering to the best we can who cares about these issues. I also have Congressional district data we can mash in to make this a tool for legislators looking to understand the issues their constituents care about.

jmandzik commented 11 years ago

Proof of concept repo: https://github.com/jmandzik/whitehouse-hackathon

jmandzik commented 11 years ago

Bare bones proof of connectivity (WTP -> Google -> FCC -> Census) can be seen here:hackathon.mandzik.org. Click an issue, then click a zip code when it loads. Pulls down the male/female split out of the census.

ogglodyte commented 11 years ago

Another field to throw in there might be Federal spending in the location - so you'd have issue, demographics, Federal $ sloshing around. The there's an API (http://www.usaspending.gov/data). Query methods are in PHP.

jmandzik commented 11 years ago

@ogglodyte Awesome. Didn't know that existed! Will definitely grab that data.

nickcatal commented 11 years ago

Why are you avoiding ZCTA data? If you read about ZCTA data on the Census site it appears they are splitting census tracts that cross zip code boundaries by looking at individual addresses.

You'd have to imagine that's way more efficient than just finding what census tract Google happens to think is in the center. Plus you remove a bunch of unnecessary calls to unnecessary APIs.

jmandzik commented 11 years ago

There were a number of warnings on the census API about conflating ZIP with ZCTA, particularly this one:

"The relationship between ZIP Code and ZCTA can be determined fully only by comparing individual block-geocoded addresses to the ZCTAs. This process is quite involved. Some examples of why the process can become quite involved are as follows: ZCTAs follow census block boundaries. In contrast, USPS ZIP Codes serve addresses with no correlation to census block boundaries; therefore, the area covered by a ZCTA may include mailing addresses associated with ZIP Codes that are not the same as the ZCTA."

That said, I'm a little new to government data and if there's a relationship that can be relied on to infer accurate census data, I'm all for it. To be honest, congressional districts are probably more compelling (higher voter overlap), more meaningful to legislators ("who cares about what in MY district"), and you can still directly query the Census API (I think... I haven't tried yet). ZIP (and I think ZCTA) are too high resolution to effectively chunk groups of voters.

Just thinking outloud, really. I do appreciate the feedback!