Open joshsmith opened 8 years ago
We'll need the following:
state_code
- e.g. WV, MD, CAstate_name
- e.g. West Virginia, Maryland, Californianumber
- 1..n
geography
- the geography
data column for the containing polygon for the CDThe SRID does appear to ship by default with PostGIS:
=> {"srid"=>"4269",
"auth_name"=>"EPSG",
"auth_srid"=>"4269",
"srtext"=>
"GEOGCS[\"NAD83\",DATUM[\"North_American_Datum_1983\",SPHEROID[\"GRS 1980\",6378137,298.257222101,AUTHORITY[\"EPSG\",\"7019\"]],TOWGS84[0,0,0,0,0,0,0],AUTHORITY[\"EPSG\",\"6269\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AUTHORITY[\"EPSG\",\"4269\"]]",
"proj4text"=>"+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs "}
The SRID above corresponds to this spatial reference entry online: http://spatialreference.org/ref/epsg/4269/
Was able to successfully import the shapefile with the following command: shp2pgsql -G -s 4269:4326 ~/Downloads/tl_2015_us_cd114/tl_2015_us_cd114.shp | psql -d movement_dev
Here's the database structure:
[5] pry(main)> ActiveRecord::Base.connection.columns("tl_2015_us_cd114")
=> [#<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edc1578
@array=false,
@cast_type=#<ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer:0x007fa03edc18c0 @limit=nil, @precision=nil, @range=-2147483648...2147483648, @scale=nil>,
@default=nil,
@default_function="nextval('tl_2015_us_cd114_gid_seq'::regclass)",
@geographic=false,
@name="gid",
@null=false,
@sql_type="integer",
@table_name="tl_2015_us_cd114">,
#<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edc0ee8
@array=false,
@cast_type=#<ActiveRecord::Type::String:0x007fa03edc1078 @limit=2, @precision=nil, @scale=nil>,
@default=nil,
@default_function=nil,
@geographic=false,
@name="statefp",
@null=true,
@sql_type="character varying(2)",
@table_name="tl_2015_us_cd114">,
#<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edc0d58
@array=false,
@cast_type=#<ActiveRecord::Type::String:0x007fa03edc1078 @limit=2, @precision=nil, @scale=nil>,
@default=nil,
@default_function=nil,
@geographic=false,
@name="cd114fp",
@null=true,
@sql_type="character varying(2)",
@table_name="tl_2015_us_cd114">,
#<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edc0790
@array=false,
@cast_type=#<ActiveRecord::Type::String:0x007fa03edc0920 @limit=4, @precision=nil, @scale=nil>,
@default=nil,
@default_function=nil,
@geographic=false,
@name="geoid",
@null=true,
@sql_type="character varying(4)",
@table_name="tl_2015_us_cd114">,
#<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edc01c8
@array=false,
@cast_type=#<ActiveRecord::Type::String:0x007fa03edc0358 @limit=41, @precision=nil, @scale=nil>,
@default=nil,
@default_function=nil,
@geographic=false,
@name="namelsad",
@null=true,
@sql_type="character varying(41)",
@table_name="tl_2015_us_cd114">,
#<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edc0038
@array=false,
@cast_type=#<ActiveRecord::Type::String:0x007fa03edc1078 @limit=2, @precision=nil, @scale=nil>,
@default=nil,
@default_function=nil,
@geographic=false,
@name="lsad",
@null=true,
@sql_type="character varying(2)",
@table_name="tl_2015_us_cd114">,
#<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa04100ab70
@array=false,
@cast_type=#<ActiveRecord::Type::String:0x007fa04100b2c8 @limit=3, @precision=nil, @scale=nil>,
@default=nil,
@default_function=nil,
@geographic=false,
@name="cdsessn",
@null=true,
@sql_type="character varying(3)",
@table_name="tl_2015_us_cd114">,
#<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edcfe20
@array=false,
@cast_type=#<ActiveRecord::Type::String:0x007fa03edcffb0 @limit=5, @precision=nil, @scale=nil>,
@default=nil,
@default_function=nil,
@geographic=false,
@name="mtfcc",
@null=true,
@sql_type="character varying(5)",
@table_name="tl_2015_us_cd114">,
#<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edcf858
@array=false,
@cast_type=#<ActiveRecord::Type::String:0x007fa03edcf9e8 @limit=1, @precision=nil, @scale=nil>,
@default=nil,
@default_function=nil,
@geographic=false,
@name="funcstat",
@null=true,
@sql_type="character varying(1)",
@table_name="tl_2015_us_cd114">,
#<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edcf1c8
@array=false,
@cast_type=#<ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Float:0x007fa03e386d10 @limit=nil, @precision=nil, @scale=nil>,
@default=nil,
@default_function=nil,
@geographic=false,
@name="aland",
@null=true,
@sql_type="double precision",
@table_name="tl_2015_us_cd114">,
#<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edcf038
@array=false,
@cast_type=#<ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Float:0x007fa03e386d10 @limit=nil, @precision=nil, @scale=nil>,
@default=nil,
@default_function=nil,
@geographic=false,
@name="awater",
@null=true,
@sql_type="double precision",
@table_name="tl_2015_us_cd114">,
#<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edcea70
@array=false,
@cast_type=#<ActiveRecord::Type::String:0x007fa03edcec00 @limit=11, @precision=nil, @scale=nil>,
@default=nil,
@default_function=nil,
@geographic=false,
@name="intptlat",
@null=true,
@sql_type="character varying(11)",
@table_name="tl_2015_us_cd114">,
#<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edce4a8
@array=false,
@cast_type=#<ActiveRecord::Type::String:0x007fa03edce638 @limit=12, @precision=nil, @scale=nil>,
@default=nil,
@default_function=nil,
@geographic=false,
@name="intptlon",
@null=true,
@sql_type="character varying(12)",
@table_name="tl_2015_us_cd114">,
#<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edcd0f8
@array=false,
@cast_type=
#<ActiveRecord::ConnectionAdapters::PostGIS::OID::Spatial:0x007fa03edcd4e0
@geo_type="MultiPolygon",
@has_m=false,
@has_z=false,
@sql_type="geography(MultiPolygon,4326)",
@srid=4326>,
@default=nil,
@default_function=nil,
@geographic=true,
@geometric_type=RGeo::Feature::MultiPolygon,
@has_m=false,
@has_z=false,
@limit={:srid=>4326, :type=>"multi_polygon", :geographic=>true},
@name="geog",
@null=true,
@sql_type="geography(MultiPolygon,4326)",
@srid=4326,
@table_name="tl_2015_us_cd114">]
An example record:
{"gid"=>"1",
"statefp"=>"29",
"cd114fp"=>"08",
"geoid"=>"2908",
"namelsad"=>"Congressional District 8",
"lsad"=>"C2",
"cdsessn"=>"114",
"mtfcc"=>"G5200",
"funcstat"=>"N",
"aland"=>"51543589873",
"awater"=>"444586815",
"intptlat"=>"+37.1941281",
"intptlon"=>"-090.9345152",
"geog"=>
"0106000020E6100..."}
It appears that the columns of most interest to us are statefp
(which is the FIPS code for a particular area, used to determine the state or administrative area), cd114fp
which is the congressional district number, the cdsessn
which refers to the Congressional session, and the geog
which appears to be a MultiPolygon
.
I also had some concerns about the potentially overlapping/non-overlapping nature of these polygons, but the Census Bureau indicates that their data is well-tested and a particular point should always fall into only one polygon.
It seems like it would be a good idea to have a pipeline for taking this information, referencing the FIPS codes and using alpha codes and full state names, cleaning up the congressional district numbers, and taking the polygon data. Everything else appears to be not so useful: land and water areas, TIGER codes, lat/long of internal points, or other data that seems to not have a good use case for us.
Playing around with this, I managed to get it to work for my own address using lat/long. A couple gotchas in that I needed to CAST
the geog
column to ::geometry
using PG shorthand, and also that ST_MakePoint
works in (long, lat)
, which was unexpected. Also good to note that the containing polygon goes first, and then the point, in st_contains
.
ActiveRecord::Base.connection.execute("SELECT * FROM tl_2015_us_cd114 WHERE ST_Contains(tl_2015_us_cd114.geog::geometry, ST_SetSRID(ST_MakePoint(-117.117029, 32.752756),4326))")
My address is a good test case, too, since I'm nearly right at the border of CD51 in California. http://ziplook.house.gov/htbin/findrep?ADDRLK85319111085319111
In other words, the spike works! We just need to go about cleaning things up and figuring out how we will query this data regularly.
Also, this query takes about 180ms on average.
Note to self: doing "08".to_s
works just fine. Don't need to do any string substitution or anything.
We need to use the -a
flag as follows: shp2pgsql -G -s 4269:4326 -a ~/Downloads/tl_2015_us_cd114/tl_2015_us_cd114.shp imported_congressional_districts | psql -d movement_dev
This appends data to the current table. One problem with this approach is that it will duplicate data. Not quite sure how to avoid this situation.
Managed to get importing working with the rgeo importer as follows:
RGeo::Shapefile::Reader.open('tl_2015_us_cd114/tl_2015_us_cd114.shp') do |file|
puts "File contains #{file.num_records} records."
file.each do |record|
subdivision = UnitedStatesSubdivision.find_by_fips_code(record.attributes["STATEFP"])
if subdivision.present? && record.geometry.present? && record.attributes["CD114FP"] != "ZZ"
CongressionalDistrict.create(
united_states_subdivision: subdivision,
number: record.attributes["CD114FP"].to_i,
congress_session: record.attributes["CDSESSN"].to_i,
state_postal_abbreviation: subdivision.postal_abbreviation,
state_name: subdivision.name,
area: record.geometry,
)
end
end
end
Had to add the Northern Mariana Islands which I left out somehow. Also noticed some data weirdness because I was getting 444 rows, but there are only 435 congressional districts.
Of these, three (3) are ZZ
, which means bodies of water. Need to skip that data. Another six (6) are numbered 98
, which means non-voting. There are also seven (7) districts that are numbered 0
, which means they are "at-large" and vote as an entire state for their congressperson. We should probably have an at_large
column to represent this, and also have a voting
column to indicate whether they are truly represented in votes or simply caucus.
More notes:
I can get the SQL here working by doing the following method on my CongressionalDistrict
model.
def self.containing(latitude, longitude)
point = "SRID=4326;POINT(#{longitude} #{latitude})"
where("#{self.table_name}.area && ?", point)
end
This is ~170x faster than the raw ST_Contains
and ST_Intersects
queries. I ran EXPLAIN
to find out why. On the raw queries, we get:
Seq Scan on congressional_districts (cost=0.00..173.78 rows=41 width=154000)
But the query plan for the &&
operator in PostGIS yields the following:
Index Scan using index_congressional_districts_on_area on congressional_districts (cost=0.14..8.16 rows=1 width=4)
Index Cond: (area && '0101000020E61000007E1CCD91956253C0AED85F764FC24340'::geography)
So the index I created using add_index :congressional_districts, :area, using: :gist
appears to be working with this query only.
Testing the second query (with &&
) there appears to be a problem, though. It looks like for my own lat/long I get three different congressional districts as results. Not really what I'm looking for.
Okay, so if I make the column a geometry
with srid: 4326
and explicitly set the srid
option the Shapefile::Reader.open
, then we're golden. Without doing so it looks like you'd get an srid of 0, which is not ideal if we want to reproject using Mercator or something.
RGeo::Shapefile::Reader.open('tl_2015_us_cd114/tl_2015_us_cd114.shp', srid: 4326) do |file|
puts "File contains #{file.num_records} records."
file.each do |record|
subdivision = UnitedStatesSubdivision.find_by_fips_code(record.attributes["STATEFP"])
if subdivision.present? && record.geometry.present? && record.attributes["CD114FP"] != "ZZ"
CongressionalDistrict.create(
united_states_subdivision: subdivision,
number: record.attributes["CD114FP"].to_i,
congress_session: record.attributes["CDSESSN"].to_i,
state_postal_abbreviation: subdivision.postal_abbreviation,
state_name: subdivision.name,
polygon: record.geometry,
)
end
end
end
Then I can simply do the following, noting that we have to explicitly set the SRID or the query will fail to run:
def self.for_location(latitude, longitude)
point = "SRID=4326;POINT(#{longitude} #{latitude})"
where("ST_Intersects(#{self.table_name}.polygon, ?)", point)
end
This gets our time down to ~4ms, which is good enough to move on at the congressional district level.
I do have an open question over on the GIS StackExchange: Are there any addresses that straddle two or more congressional districts?. Note that if you use a point along the bounding line of the polygon you can get 2+ results. I have to imagine this would be a doozy if there were a real address here:
It does appear that congressional districts are drawn according to census block, so the lines go down the middle of streets and highways where appropriate. This does appear to mean that an address won't end up in the wrong district.
Another thought that just came to mind is that we'll need to deal with lat/long for congressional districts and matching them to voters when coming up with walking lists. This is going to be problematic since I think the voter files do not have accurate lat/long information, and if so if it's accurate at all.
This is still something I need to learn about myself, but right now, I have 2 questions
@begedin:
In that case, my guess is, we'd be querying voters for current district AND within a certain distance from the volunteer's location?
Since you say voter files don't really have accurate lat/long information, I guess that data would start out inaccurate and then get better with time? Maybe we could initially use address information to get lat/long from another service? Since that tends to not change, we could then cache it in our database.
Also, do we want to cache districts as well, or would you say the performance of the process of figuring out the district is good enough not to do it?
@begedin that's right, that's how the query would run. I think it's going to be optimized enough based on what I'm working with so far. 4ms is far from bad.
Regarding location data, the problem is that we don't have the data from the voter file and getting from another service is expensive. We have to plan for that now. The cost will be $X0,000 at least.
Making notes here for now. Using the Tiger/Line® 2015 data from the US Census Bureau for the 114th Congress.
Using this blog post for reference: https://devmynd.com/blog/2013-7-postgis-rails-on-heroku/
Extracted the EPSG code from the
.prj
file: