getmovement / deprecated-movement-rails-api

DEPRECATED Rails API for getmovement.org
1 stars 1 forks source link

Import shapefiles for congressional districts #51

Open joshsmith opened 8 years ago

joshsmith commented 8 years ago

Making notes here for now. Using the Tiger/Line® 2015 data from the US Census Bureau for the 114th Congress.

Using this blog post for reference: https://devmynd.com/blog/2013-7-postgis-rails-on-heroku/

Extracted the EPSG code from the .prj file:

GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137,298.257222101]],PRIMEM["Greenwich",0],UNIT["Degree",0.017453292519943295]]
joshsmith commented 8 years ago

We'll need the following:

joshsmith commented 8 years ago

The SRID does appear to ship by default with PostGIS:

=> {"srid"=>"4269",
 "auth_name"=>"EPSG",
 "auth_srid"=>"4269",
 "srtext"=>
  "GEOGCS[\"NAD83\",DATUM[\"North_American_Datum_1983\",SPHEROID[\"GRS 1980\",6378137,298.257222101,AUTHORITY[\"EPSG\",\"7019\"]],TOWGS84[0,0,0,0,0,0,0],AUTHORITY[\"EPSG\",\"6269\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AUTHORITY[\"EPSG\",\"4269\"]]",
 "proj4text"=>"+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs "}
joshsmith commented 8 years ago

The SRID above corresponds to this spatial reference entry online: http://spatialreference.org/ref/epsg/4269/

joshsmith commented 8 years ago

Was able to successfully import the shapefile with the following command: shp2pgsql -G -s 4269:4326 ~/Downloads/tl_2015_us_cd114/tl_2015_us_cd114.shp | psql -d movement_dev

Here's the database structure:

[5] pry(main)> ActiveRecord::Base.connection.columns("tl_2015_us_cd114")
=> [#<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edc1578
  @array=false,
  @cast_type=#<ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Integer:0x007fa03edc18c0 @limit=nil, @precision=nil, @range=-2147483648...2147483648, @scale=nil>,
  @default=nil,
  @default_function="nextval('tl_2015_us_cd114_gid_seq'::regclass)",
  @geographic=false,
  @name="gid",
  @null=false,
  @sql_type="integer",
  @table_name="tl_2015_us_cd114">,
 #<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edc0ee8
  @array=false,
  @cast_type=#<ActiveRecord::Type::String:0x007fa03edc1078 @limit=2, @precision=nil, @scale=nil>,
  @default=nil,
  @default_function=nil,
  @geographic=false,
  @name="statefp",
  @null=true,
  @sql_type="character varying(2)",
  @table_name="tl_2015_us_cd114">,
 #<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edc0d58
  @array=false,
  @cast_type=#<ActiveRecord::Type::String:0x007fa03edc1078 @limit=2, @precision=nil, @scale=nil>,
  @default=nil,
  @default_function=nil,
  @geographic=false,
  @name="cd114fp",
  @null=true,
  @sql_type="character varying(2)",
  @table_name="tl_2015_us_cd114">,
 #<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edc0790
  @array=false,
  @cast_type=#<ActiveRecord::Type::String:0x007fa03edc0920 @limit=4, @precision=nil, @scale=nil>,
  @default=nil,
  @default_function=nil,
  @geographic=false,
  @name="geoid",
  @null=true,
  @sql_type="character varying(4)",
  @table_name="tl_2015_us_cd114">,
 #<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edc01c8
  @array=false,
  @cast_type=#<ActiveRecord::Type::String:0x007fa03edc0358 @limit=41, @precision=nil, @scale=nil>,
  @default=nil,
  @default_function=nil,
  @geographic=false,
  @name="namelsad",
  @null=true,
  @sql_type="character varying(41)",
  @table_name="tl_2015_us_cd114">,
 #<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edc0038
  @array=false,
  @cast_type=#<ActiveRecord::Type::String:0x007fa03edc1078 @limit=2, @precision=nil, @scale=nil>,
  @default=nil,
  @default_function=nil,
  @geographic=false,
  @name="lsad",
  @null=true,
  @sql_type="character varying(2)",
  @table_name="tl_2015_us_cd114">,
 #<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa04100ab70
  @array=false,
  @cast_type=#<ActiveRecord::Type::String:0x007fa04100b2c8 @limit=3, @precision=nil, @scale=nil>,
  @default=nil,
  @default_function=nil,
  @geographic=false,
  @name="cdsessn",
  @null=true,
  @sql_type="character varying(3)",
  @table_name="tl_2015_us_cd114">,
 #<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edcfe20
  @array=false,
  @cast_type=#<ActiveRecord::Type::String:0x007fa03edcffb0 @limit=5, @precision=nil, @scale=nil>,
  @default=nil,
  @default_function=nil,
  @geographic=false,
  @name="mtfcc",
  @null=true,
  @sql_type="character varying(5)",
  @table_name="tl_2015_us_cd114">,
 #<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edcf858
  @array=false,
  @cast_type=#<ActiveRecord::Type::String:0x007fa03edcf9e8 @limit=1, @precision=nil, @scale=nil>,
  @default=nil,
  @default_function=nil,
  @geographic=false,
  @name="funcstat",
  @null=true,
  @sql_type="character varying(1)",
  @table_name="tl_2015_us_cd114">,
 #<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edcf1c8
  @array=false,
  @cast_type=#<ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Float:0x007fa03e386d10 @limit=nil, @precision=nil, @scale=nil>,
  @default=nil,
  @default_function=nil,
  @geographic=false,
  @name="aland",
  @null=true,
  @sql_type="double precision",
  @table_name="tl_2015_us_cd114">,
 #<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edcf038
  @array=false,
  @cast_type=#<ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Float:0x007fa03e386d10 @limit=nil, @precision=nil, @scale=nil>,
  @default=nil,
  @default_function=nil,
  @geographic=false,
  @name="awater",
  @null=true,
  @sql_type="double precision",
  @table_name="tl_2015_us_cd114">,
 #<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edcea70
  @array=false,
  @cast_type=#<ActiveRecord::Type::String:0x007fa03edcec00 @limit=11, @precision=nil, @scale=nil>,
  @default=nil,
  @default_function=nil,
  @geographic=false,
  @name="intptlat",
  @null=true,
  @sql_type="character varying(11)",
  @table_name="tl_2015_us_cd114">,
 #<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edce4a8
  @array=false,
  @cast_type=#<ActiveRecord::Type::String:0x007fa03edce638 @limit=12, @precision=nil, @scale=nil>,
  @default=nil,
  @default_function=nil,
  @geographic=false,
  @name="intptlon",
  @null=true,
  @sql_type="character varying(12)",
  @table_name="tl_2015_us_cd114">,
 #<ActiveRecord::ConnectionAdapters::PostGIS::SpatialColumn:0x007fa03edcd0f8
  @array=false,
  @cast_type=
   #<ActiveRecord::ConnectionAdapters::PostGIS::OID::Spatial:0x007fa03edcd4e0
    @geo_type="MultiPolygon",
    @has_m=false,
    @has_z=false,
    @sql_type="geography(MultiPolygon,4326)",
    @srid=4326>,
  @default=nil,
  @default_function=nil,
  @geographic=true,
  @geometric_type=RGeo::Feature::MultiPolygon,
  @has_m=false,
  @has_z=false,
  @limit={:srid=>4326, :type=>"multi_polygon", :geographic=>true},
  @name="geog",
  @null=true,
  @sql_type="geography(MultiPolygon,4326)",
  @srid=4326,
  @table_name="tl_2015_us_cd114">]
joshsmith commented 8 years ago

An example record:

{"gid"=>"1",
 "statefp"=>"29",
 "cd114fp"=>"08",
 "geoid"=>"2908",
 "namelsad"=>"Congressional District 8",
 "lsad"=>"C2",
 "cdsessn"=>"114",
 "mtfcc"=>"G5200",
 "funcstat"=>"N",
 "aland"=>"51543589873",
 "awater"=>"444586815",
 "intptlat"=>"+37.1941281",
 "intptlon"=>"-090.9345152",
 "geog"=>
  "0106000020E6100..."}
joshsmith commented 8 years ago

It appears that the columns of most interest to us are statefp (which is the FIPS code for a particular area, used to determine the state or administrative area), cd114fp which is the congressional district number, the cdsessn which refers to the Congressional session, and the geog which appears to be a MultiPolygon.

I also had some concerns about the potentially overlapping/non-overlapping nature of these polygons, but the Census Bureau indicates that their data is well-tested and a particular point should always fall into only one polygon.

joshsmith commented 8 years ago

It seems like it would be a good idea to have a pipeline for taking this information, referencing the FIPS codes and using alpha codes and full state names, cleaning up the congressional district numbers, and taking the polygon data. Everything else appears to be not so useful: land and water areas, TIGER codes, lat/long of internal points, or other data that seems to not have a good use case for us.

joshsmith commented 8 years ago

Playing around with this, I managed to get it to work for my own address using lat/long. A couple gotchas in that I needed to CAST the geog column to ::geometry using PG shorthand, and also that ST_MakePoint works in (long, lat), which was unexpected. Also good to note that the containing polygon goes first, and then the point, in st_contains.

ActiveRecord::Base.connection.execute("SELECT * FROM tl_2015_us_cd114 WHERE ST_Contains(tl_2015_us_cd114.geog::geometry, ST_SetSRID(ST_MakePoint(-117.117029, 32.752756),4326))")

My address is a good test case, too, since I'm nearly right at the border of CD51 in California. http://ziplook.house.gov/htbin/findrep?ADDRLK85319111085319111

In other words, the spike works! We just need to go about cleaning things up and figuring out how we will query this data regularly.

Also, this query takes about 180ms on average.

joshsmith commented 8 years ago

Note to self: doing "08".to_s works just fine. Don't need to do any string substitution or anything.

joshsmith commented 8 years ago

We need to use the -a flag as follows: shp2pgsql -G -s 4269:4326 -a ~/Downloads/tl_2015_us_cd114/tl_2015_us_cd114.shp imported_congressional_districts | psql -d movement_dev

This appends data to the current table. One problem with this approach is that it will duplicate data. Not quite sure how to avoid this situation.

joshsmith commented 8 years ago

Managed to get importing working with the rgeo importer as follows:

RGeo::Shapefile::Reader.open('tl_2015_us_cd114/tl_2015_us_cd114.shp') do |file|
  puts "File contains #{file.num_records} records."
  file.each do |record|
    subdivision = UnitedStatesSubdivision.find_by_fips_code(record.attributes["STATEFP"])
    if subdivision.present? && record.geometry.present? && record.attributes["CD114FP"] != "ZZ"
      CongressionalDistrict.create(
        united_states_subdivision: subdivision,
        number: record.attributes["CD114FP"].to_i,
        congress_session: record.attributes["CDSESSN"].to_i,
        state_postal_abbreviation: subdivision.postal_abbreviation,
        state_name: subdivision.name,
        area: record.geometry,
      )
    end
  end
end

Had to add the Northern Mariana Islands which I left out somehow. Also noticed some data weirdness because I was getting 444 rows, but there are only 435 congressional districts.

Of these, three (3) are ZZ, which means bodies of water. Need to skip that data. Another six (6) are numbered 98, which means non-voting. There are also seven (7) districts that are numbered 0, which means they are "at-large" and vote as an entire state for their congressperson. We should probably have an at_large column to represent this, and also have a voting column to indicate whether they are truly represented in votes or simply caucus.

joshsmith commented 8 years ago

More notes:

I can get the SQL here working by doing the following method on my CongressionalDistrict model.

  def self.containing(latitude, longitude)
    point = "SRID=4326;POINT(#{longitude} #{latitude})"
    where("#{self.table_name}.area && ?", point)
  end

This is ~170x faster than the raw ST_Contains and ST_Intersects queries. I ran EXPLAIN to find out why. On the raw queries, we get:

Seq Scan on congressional_districts (cost=0.00..173.78 rows=41 width=154000)

But the query plan for the && operator in PostGIS yields the following:

Index Scan using index_congressional_districts_on_area on congressional_districts  (cost=0.14..8.16 rows=1 width=4)
   Index Cond: (area && '0101000020E61000007E1CCD91956253C0AED85F764FC24340'::geography)

So the index I created using add_index :congressional_districts, :area, using: :gist appears to be working with this query only.

joshsmith commented 8 years ago

Testing the second query (with &&) there appears to be a problem, though. It looks like for my own lat/long I get three different congressional districts as results. Not really what I'm looking for.

joshsmith commented 8 years ago

Okay, so if I make the column a geometry with srid: 4326 and explicitly set the srid option the Shapefile::Reader.open, then we're golden. Without doing so it looks like you'd get an srid of 0, which is not ideal if we want to reproject using Mercator or something.

RGeo::Shapefile::Reader.open('tl_2015_us_cd114/tl_2015_us_cd114.shp', srid: 4326) do |file|
  puts "File contains #{file.num_records} records."
  file.each do |record|
    subdivision = UnitedStatesSubdivision.find_by_fips_code(record.attributes["STATEFP"])
    if subdivision.present? && record.geometry.present? && record.attributes["CD114FP"] != "ZZ"
      CongressionalDistrict.create(
        united_states_subdivision: subdivision,
        number: record.attributes["CD114FP"].to_i,
        congress_session: record.attributes["CDSESSN"].to_i,
        state_postal_abbreviation: subdivision.postal_abbreviation,
        state_name: subdivision.name,
        polygon: record.geometry,
      )
    end
  end
end

Then I can simply do the following, noting that we have to explicitly set the SRID or the query will fail to run:

  def self.for_location(latitude, longitude)
    point = "SRID=4326;POINT(#{longitude} #{latitude})"
    where("ST_Intersects(#{self.table_name}.polygon, ?)", point)
  end

This gets our time down to ~4ms, which is good enough to move on at the congressional district level.

I do have an open question over on the GIS StackExchange: Are there any addresses that straddle two or more congressional districts?. Note that if you use a point along the bounding line of the polygon you can get 2+ results. I have to imagine this would be a doozy if there were a real address here:

screen shot 2016-04-10 at 11 22 14 am

It does appear that congressional districts are drawn according to census block, so the lines go down the middle of streets and highways where appropriate. This does appear to mean that an address won't end up in the wrong district.

Another thought that just came to mind is that we'll need to deal with lat/long for congressional districts and matching them to voters when coming up with walking lists. This is going to be problematic since I think the voter files do not have accurate lat/long information, and if so if it's accurate at all.

begedin commented 8 years ago

This is still something I need to learn about myself, but right now, I have 2 questions

  1. If I understand everything correctly, we will be able to assign a user to a district in a sufficiently performant way, not matter what, and it's completely accurate provided the'y aren't standing on a district border?
  2. For walking lists, does the district matter. Are we constraining volunteers to just visit addresses in their own district, or within a certain radius from their location?
joshsmith commented 8 years ago

@begedin:

  1. Yes, and provided their location is accurately falling within the borders of the polygon.
  2. Yes, districts matter for walking lists when it concerns knowing what voters to talk to in a congressional or state legislative election. Even more so the more hyperlocal you get. You can talk to someone on the wrong side of the street entirely:

screen shot 2016-04-11 at 8 45 35 am

begedin commented 8 years ago

In that case, my guess is, we'd be querying voters for current district AND within a certain distance from the volunteer's location?

Since you say voter files don't really have accurate lat/long information, I guess that data would start out inaccurate and then get better with time? Maybe we could initially use address information to get lat/long from another service? Since that tends to not change, we could then cache it in our database.

Also, do we want to cache districts as well, or would you say the performance of the process of figuring out the district is good enough not to do it?

joshsmith commented 8 years ago

@begedin that's right, that's how the query would run. I think it's going to be optimized enough based on what I'm working with so far. 4ms is far from bad.

Regarding location data, the problem is that we don't have the data from the voter file and getting from another service is expensive. We have to plan for that now. The cost will be $X0,000 at least.