codeforamerica / address-normalizer

A simple tool that takes a CSV file with addresses and normalizes them.
BSD 3-Clause "New" or "Revised" License
16 stars 4 forks source link

Implement new approach using external API for verification/gazetteering #1

Open daguar opened 11 years ago

daguar commented 11 years ago

Planned approach

  1. Tokenize all addresses
    • [If not already tokenized]
    • [May be out of scope for v1: the key data sets we’re looking at now are tokenized]
  2. Pre-process the street types (e.g., Ave. => Ave ) to one canonical form
  3. Create list of unique street name + type combinations (with one real street # from data source per combination)
  4. Call external API (EasyPost?) on each entry in the unique combo list and save the tokenized output in new columns
  5. Outputs:
    • Original data set with appended, normalized addresses (per EasyPost filter)
    • Gazette of street name + type combos
daguar commented 11 years ago

Pinging @dthompson and @spara

Let's implement this new approach here, since some of the logic's already baked in.

daguar commented 11 years ago

Also pinging @bensheldon (who is interested) for great justice.

mholubowski commented 11 years ago

I'd be happy to take a shot at this as my first project of the summer (GSoC) - I'll be flying up to the office Wednesday morning. Quick questions for @daguar:

daguar commented 11 years ago

Great!

For data sets, I have a few good test cases: A. City restaurant inspections data sets (part of the LIVES data standard push; see: http://foodinspectiondata.us/ ). I'm going to tag @dthompson @danavery @migurski for FYI, since it could support some efforts we're doing to help make this data more integration-friendly. B. Cross-referencing two city data sets from an open data portal where we have a crosswalk (eg, property ID) but where each data set has address fields that differ from one another. This would allow us to benchmark this approach.

In my ideal world, this would be available as both a web app (for less technical users) as well as a CLI tool for more heavy-duty use (in particular where the API calls might become an issue).