Load FIPS Ids & their corresponding Names & Types into API

jnmarcus commented 8 years ago

FIPS Ids are used nationwide at the city, county, and state level. Per @chellrocks suggestion, and based on some various onsite discussions, we had previously decided that using FIPS ids made the most sense as it will help us to use a uniform identification nomenclature & not force us to re-write the wheel where we don't have to. Now, we need to load this data into a consumable API :smile:

Based on data retrieved by @chellrocks from ??(source needed)?? this spreadsheet was used as a starting point to aggregate the state and county fips ids for California. Note that some numbers in the state tab are missing because they are US entities considered out of scope for the sake of this project. California's FIPS ID is 6 and its corresponding county's FIPS Ids can be found on the second tab of the spreadsheet.

As of right now, it looks like we have only logged fips data for California's counties, not cities, so there is more work to do on that end. That being said, @adborden and @tdooner it should be noted that the fips id you are currently loading in the front end for the 'city' of San Francisco (6075), is in fact the county fips id.

Additionally, it appears that there are multiple counties whose name(s) are the same as the city. We should probably take inventory of this, and discuss how we are going to handle these cases.

jnmarcus commented 8 years ago

One idea I had had for how the city and county data could look (or rather the immediate data we would need with a city and or county), was something like this:

{
  "countyName": "alameda",
  "type": "county",
  "fip_id": "6001",
  "hasCities": [
    {
      "cityName": "oakland",
      "type": "city",
      "fip_id": "6001_??",
      "ofCounty": {
        "countyName": "alameda",
        "type": "county",
        "fip_id": "6001"
      },
      "ofState": {
        "stateName": "california",
        "type": "state",
        "fip_id": "6"
      },
      "hasZipCodes": [
        { 
         "zipcode": "94601", 
         "type": "zipcode", 
         "ofCity": { 
           "cityName": "oakland", 
           "type": "city", 
           "fip_id": "6001_??" 
         }, 
         "ofCounty": {
           "countyName": "alameda county", 
           "type": "county", 
           "fip_id": "6001"
         }
       }
      ],
      "collectsCampaignFinanceData": "true",
      "campaignFinanceDataSources": [
        { "name": "", "href": "" }
      ],
      "electionDataSummary": {
        "hasElectionData": true,
        "isOnline": true,
        "isPubliclyAccessible": true,
        "isMachineReadable": "",
        "pastElectionData": {
          "hasPastElectionData": true,
          "yearsPastElectionDataCollected": [
            {
              "year": "2014",
              "isFiledOnline": true,
              "isPubliclyAccessible": true,
              "isMachineReadable": true
            }
          ],
          "pastElectionDataSources": [
            { "name": "", "href": "" }
          ]
        },
        "upcomingElectionData": {
          "hasUpcomingElectionData": "",
          "isCollectingUpcomingElectionData": "",
          "dataCollectionStartDate": "DD/MM/YYYY",
          "dataCollectionEndDate": "DD/MM/YYYY",
          "dataFiledOnline": "",
          "dataPubliclyAccessible": "",
          "dataMachineReadable": "",
          "dataUpdateFrequency": ""
        }
      }
    }
  ]
}

mikeubell commented 8 years ago

Since San Francisco is both a county and a city using the county FIP is probably right.

On Dec 20, 2015, at 11:55 PM, Jamie Marcus notifications@github.com wrote:

FIPS Ids are used nationwide at the city, county, and state level. Per @chellrocks suggestion, and based on some various onsite discussions, we had previously decided that using FIPS ids made the most sense as it will help us to use a uniform identification nomenclature & not force us to re-write the wheel where we don't have to. Now, we need to load this data into a consumable API

Based on data retrieved by @chellrocks from ??(source needed)?? this spreadsheet was used as a starting point to aggregate the state and county fips ids for California. Note that some numbers in the state tab are missing because they are US entities considered out of scope for the sake of this project. California's FIPS ID is 6 and its corresponding county's FIPS Ids can be found on the second tab of the spreadsheet.

As of right now, it looks like we have only logged fips data for California's counties, not cities, so there is more work to do on that end. That being said, @adborden and @tdooner it should be noted that the fips id you are currently loading in the front end for the 'city' of San Francisco (6075), is in fact the county fips id.

— Reply to this email directly or view it on GitHub.

bcipolli commented 8 years ago

Thanks @jnmarcus ! A few thoughts...

On FIPS codes:

FIPS codes should probably be added manually on an as-needed basis, since we only want to allow users to search for city/county/state data that we actually have in the database. So, as long as we have a proper Django model to represent the data, we can use the Django admin interface to add those codes (and state-county-city relationships).
Here's one place we can get city codes from.
Looks like zip codes should be added automatically, since it would be significant work for anybody to find and add all zip codes for a new city. Here's an example data source that could be scraped and saved, then queried later on an as-needed basis.

On your API JSON structure:

What's the use-case for the API endpoint? Is it possible that breaking it into two endpoints (one for representing hierarchy, the other for representing a full list of zip codes available) might make more sense?
Is there any reason to have a ofCounty value when the parent is the county? That data redundant in the hierarchy; what is the need for representing it twice?
Same with the zip code information; is there any reason to keep the redundant information? (I really don't know :smile:)
Given the hierarchy idea, should we start with the state at the highest level?
If the redundancy is needed, then is there any reason not to have a flat JSON list returned, with each zip code at the highest level?

bcipolli commented 8 years ago

As far as Django models go, setting up the models seemed straightforward. Glad to hear if others have ideas for a different structure:

City model (fips_id PK, name, county_fips_id)
County model (fips_id PK, name, state_fips_id)
State model (fips_id PK, name)
Zip model (id PK, zip_code, city_fips_id) - where id is generated by Django.

To start, all could be added manually. We could add an issue to have a management command and/or form to assist with adding the data, including auto-add of all relevant zip codes.

adborden commented 8 years ago

@jnmarcus I only see state and county data in your spreadsheet, do we have fips for cities?

bcipolli commented 8 years ago

@adborden nope; I posted a link with city IDs, and some suggestions how to use them, in my follow-ups.

jnmarcus commented 8 years ago

@bcipolli a couple things :smile:

I'm not sure if we had fully decided how we were going to handle cities/counties (at least larger cities) that do not have data available. there was some talk about exposing to users that data was not available for the area they were looking for, with textual information on where/with whom they could follow up with for more information, but I don't remember if a decision on that was reached (in general, or for v1)
your question about use-case for API endpoint...are you referring to what endpoint this data would be retrieved back from? (sorry if that's a noob question)
the ofCounty and ofState info was meant strictly for redundancy and due to my lack of knowledge on the level of certainty we can have with this data - ie if it all gets disorganized, and we need to piece it back together, do we need additional levels of redundancy in place in order to double, triple verify that we have the right info linked to the right place (I really have no idea if this level of redundancy is necessary, I just know in Excel, I would maybe have it, just in case...)

Also, I saw a lot of chatter about the zip codes. It's my personal opinion we don't need zip code info right away (in v1 at least), but that is just my opinion and we should probably collect that from the group. That being said, I'm not sure if I understand why the zipcode would be at the highest level, there's also more than one zip code generally associated with a city, so I'm not really sure if that would work...I'm also not sure if I understand correctly what you mean by highest level...can u elaborate? :stuck_out_tongue:

Let me know if I forgot to answer anything :smiley:

jnmarcus commented 8 years ago

Also, I found this resource, which may be helpful http://www.census.gov/geo/reference/ansi.html it also has the voting districts available...were we missing that?

bcipolli commented 8 years ago

your question about use-case for API endpoint...are you referring to what endpoint this data would be retrieved back from? (sorry if that's a noob question)

I just mean, what are the front-end actions (components?) that we're trying to support with the API you mocked up?

why the zipcode would be at the highest level

Ya, I think.... just forget that :) Info about the front-end components or actions that will consume this data can help figure out how to output it best.

tdooner commented 8 years ago

I'm not sure if I understand why the zipcode would be at the highest level, there's also more than one zip code generally associated with a city, so I'm not really sure if that would work

Wait, how are we thinking of mapping data from NetFile/Cal-Access into a jurisdiction without zip codes? My understanding was that we would assign a list of zip codes to a FIPS code, and then we could match up contributions based on those ZIPs.

Or if we associate committees to FIPS codes, then we wouldn't need to worry about ZIP codes as you mention, but we also wouldn't be able to know what to do with contributions that aren't to those associated committees.

bcipolli commented 8 years ago

If I understood well, each contribution is reported within a specific jurisdiction, and so will have the FIPS code of the jurisdiction whose data we pull. For independent expenditures, the same committee can spend money in multiple jurisdictions, so I think FIPS on the contribution itself is the right way to go.

@andrell81, any comments on how we might use (or not use) zip codes? Do you recall them being available on a per-transaction basis?

bcipolli commented 8 years ago

I will take this one.

bcipolli commented 8 years ago

@jnmarcus are FIPS unique across city vs. county vs. state, or only unique within the group? I.e. if I have a FIPS code, is that enough to identify exactly what it is, or do I also need to know if it's a county vs. city FIPS?

Just browsing here, seemed to suggest it's not fully unique, just under the sub-category. http://www.census.gov/geo/reference/codes/cousub.html

bcipolli commented 8 years ago

Ok, reviewing this county data: http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt

FIPS codes aren't even unique within the category. there are multiple counties with FIPS 001, but only one per state. So, FIPS alone isn't enough as a unique identifier.

polkapolka commented 8 years ago

Ben, 5 digit fips codes are unique. 2 digit state code + 3 digit county code.

On Mon, Dec 28, 2015 at 2:48 PM, Ben Cipollini notifications@github.com wrote:

Ok, reviewing this county data: http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt

FIPS codes aren't even unique within the category. there are multiple counties with FIPS 001, but only one per state. So, FIPS alone isn't enough as a unique identifier.

— Reply to this email directly or view it on GitHub https://github.com/caciviclab/disclosure-backend/issues/94#issuecomment-167673361 .

bcipolli commented 8 years ago

It'd be great to hear from others the problem trying to be solved by using fips_id. After researching further this morning, it's not clear to me. Instead, it's very clear to me what challenges we'll face by using fips_id, and why we'd want to avoid making that a core part of this app.

The main benefit I see for a globally unique ID: *Is there any front-end use-case where we don't know the locality type (e.g. city/county/state)? I don't see one.

So the main benefit would be so that we can store city/county/state voting and contribution data in the same tables, and avoid having different API endpoints or query functions (based on type of locality) for getting those info out to the front-end.

Why use FIPS ID? I see many challenges and literally zero benefits to doing so:

FIPS ID resolves to the county level. We certainly need city IDs. So... seems like a non-starter. For an external ID, ANSI place codes or Voting District codes seem more appropriate.
But why use a third party ID internally in the app at all? As far as I understand, there is nothing in our data that makes this easy, such as having a third-party ID that classifies data. Instead, this is a very hard problem; we either:
- Have to find a way to match text data to the external ID when pulling in code (which is a very hard problem), or
- Have to manually intervene to add these external codes when adding localities.

An alternative that sounds much more appealing to me is, to use arbitrary internal IDs. This:

Allows us to create new places without trying to resolve any ID.
If our data require us to resolve places across rows (which I don't believe we ever have to do anyway), that matching should be easier (since a feed should be internally consistent about how it represents places) than trying to resolve it to an external ID.
Allows the core app to be agnostic about what type of voting regions are being represented (precinct/city/county/state/etc.)
Allows add-on apps to be used, instead, for resolving between internal IDs and external IDs. This is good for modularity, as well as good for iteratively building the app. If unnecessary to do this resolution for representing any of our data, we could simply offload this work to later versions of the app, rather than having to do this work at the beginning. Say, when we're trying to link our app to other apps, we simply add code that does the resolution of places, as defined in our data, to places defined externally.

So to summarize, I don't see the benefit of using fips_id in our code, but I see plenty of challenges. To push forward on the back-end design and development, it'd be really helpful to understand why that direction was chosen. If I missed it in a doc somewhere, or am forgetting something obvious, really, I apologize...

:tada: Happy New Year! :)

adborden commented 8 years ago

I don't really understand what the this issue represents. What is this API endpoint going to be used for? The way I see it, there are two use cases involving locality (a geographic area):

Identify what Ballot, and the corresponding finance data to show to the user. Your Ballot (what you vote on) is determined by your voting district, and hence your address. In v1, we're not going more granular than city, so zipcode or city name would be acceptable keys. It would be nice if we didn't pick an Id that limited us to a city being the smallest granularity, but hey, it's v1.
Identify what Contributions are related to which Locality. This is so that we know if the money is related to your Ballot, or related to your Locality, or your metro area. Again, since we're not going deeper than city, zipcode which exists in the netfile data should be a good key.

@jnmarcus is there a different use case that you're looking for?

bcipolli commented 8 years ago

Zip codes to cities are a many-to-many relationship; you can't always pick a city from a zip code, nor vice verse. Zip codes are unaffiliated with any political zoning.

Regardless, we'll have to some value onto a locality ID. My only point here is that fips_id is strictly worse than a simple arbitrary internal ID. As far as I see, zip code doesn't solve the problem either.

I think zip code is a great key for sending to an API (since users know it), but bad for back-end storage (for the reasons listed above).

adborden commented 8 years ago

@bcipolli agreed, I'm just describing a use case based on mapping zipcodes to localities. I'm not proposing anything about what to use as the primary identifier. BTW, the zipcode issue has been discussed on several occasions, I'll open an issue so we can track it better.

In fact, my point is more that there are multiple use cases, so going with an arbitrary Id would be preferred, but we'll still want additional fields to be able to map for different use cases.

What is useful about the fips, is that we have a standard and complete list of all city/county/state in the country (right?). As @jnmarcus mentioned above, I think we do want to have all the localities in the DB so we can display better messaging along the lines of "Sorry, the data for Fremont is incomplete, here's who you can call in your local government to change that."

jnmarcus commented 8 years ago

@bcipolli FIPS ids were suggested as an alternative for unique ids, when we were trying to come up with a way to deal with cross-referencing that didn't involve matching strings, as this a) could be memory/process intensive, and b) could result in false representations of the data.

Basically, a couple of the problems we were trying to solve were:

when you have a county with the same name as its main city, how can you identify which election data (city or county) you're talking about?
when you have multiple candidates (for example, on the county level), you run the risk of a repeat name. Even with candidate ids, how can you be certain you're talking about the right 'John Doe', for the right county?
what is the common denominator between ballot measures, candidates, and locale data?
what zip codes belong to what county/city?

I don't believe that fips ids will solve all of our problems, but I'm not sure creating our own is the way to go either. We would need some type of methodology to identify between state, county, and city, which is how we got on the topic of fips ids to begin with - essentially the idea was 'why recreate the wheel when numbers like this are already used in similar fashion, in a publicly recognized standard?' Additionally, I think it's definitely possible we may need more than one type of unique identifier, but from what I understand, those types of identifiers will ultimately go back to either a county or the state level...(I believe)

On the contrary, I read that fips ids are being retired...but they also still seem to be commonly used...so there's that :)

Here are a couple references that I found helpful: Definitions of various Geo Codes: http://www.census.gov/geo/reference/geocodes.html - note the Legal/Statistical Area Description Codes... Hierarchy Diagram of Geographic Entities: http://www2.census.gov/geo/pdfs/reference/geodiagram.pdf ANSI Codes: http://www.census.gov/geo/reference/ansi.html

@adborden we're not going deeper than city data? I thought I'd heard the opposite, especially from the @bcipolli and the San Diego team, who expressed interested in wanting to get their county data up...

bcipolli commented 8 years ago

Since we have an internal id for localities now, mapping to FIPS can be pushed off a bit. It's great to provide front-end links with meaningful Ids (like FIPS), and to allow searches... but this no longer blocks general API development.

bcipolli commented 8 years ago

I don't think we need this for our demo.

caciviclab / disclosure-backend

Load FIPS Ids & their corresponding Names & Types into API #94