CityOfNewYork / CROL-Overview

City Record Online parsing libraries and supporting files
26 stars 14 forks source link

Finish an end-to-end MVP parsing library #36

Open mikaelmh opened 9 years ago

mikaelmh commented 9 years ago

The purpose of this is to give DCAS and DOITT an example of the parsing structure, in order to prepare for later development.

mattalhonte commented 9 years ago

So, here's a notebook where I wrote a pretty barebones (but modular!) parser to rip out sections of an ad that contain an address. http://nbviewer.ipython.org/gist/anonymous/4185a30818e315fdd720

Addresses are definitely not entered in a standard way (or if they are, it's agency-by-agency). Sometimes they're formatted as proper mailing addresses (these are the best! They always end with "NY <5-digit number>"), sometimes it'll be "Borough of Brooklyn" instead of "Brooklyn, NY 46992" which is significantly tougher.

First, I missed around with some simple regex that I figured would approximate addresses. Then I found that was a little too inflexible, given how heterogeneous our formats are. The only feature common to all of them was that, at some point, they have a number followed by a space followed by a letter - which was slightly too permissive a standard.

Then, I did a pretty standard NLP move - I found what seemed like the common elements to the addresses in our data set, then encoded them as a bunch of regex features, then put them in a list (or a "feature vector", as the kids call it). The code then takes a string, checks how many of the address features are present, and gives that string a score.

Right now, given an ad, it checks each sentence and puts it in a special list if it has address features Interestingly, a sentence with just one address feature contains an address pretty consistently - though I wrote the function with a tweakable threshold for the future.

No proper ML at the moment. Hidden Step 0 to any ML task is to make sure you can't do it in a more reliable, deterministic way. But, making feature vectors would definitely be step 1 to making a parser that learns. At the moment, the workflow is "play around with data of interest, eyeball it, then hand-craft your classifier" - but this is the foundation of making something that can learn.

cds-amal commented 9 years ago

@mattalhonte - Good stuff! I like your approach, it will fit nicely into our pipeline ( parser1 | parser2 | parser3 ). Gonna definitely play with this. Can you check if http://regexlib.com/Search.aspx?k=street has any regex patterns we could use? There may be somethings we can "borrow" from their effort.

mikaelmh commented 9 years ago

@mattalhonte Yes, great stuff! The python book is very clear. Not sure, but there might be something interesting structure we can get from these links as well for addreses; http://cliff.mediameter.org/ & https://github.com/Berico-Technologies

cds-amal commented 9 years ago

@mattalhonte - usaddress is a python library for parsing unstructured address strings into address components, using advanced NLP methods.