cfpb / grasshopper

CFPB's streaming batch geocoder
Creative Commons Zero v1.0 Universal
37 stars 13 forks source link

Parse unusual addresses #81

Open debseidner opened 9 years ago

debseidner commented 9 years ago

As a geocoding system, I need to be able to parses unusual addresses in the geocoder so I can handle finding the (x,y) for addresses like 32 1/2 Main Street, 18-D Main Street, etc.

hkeeler commented 9 years ago

Starting to do bit of deeper testing on the results of the address parser. The following addresses parse as follows. usaddress has two parsing methods, "parse" (default) and "tag" (generally better results).:

  1. 32 1/2 Main Street (method: "parse" and "tag")

    "parts": {
       "AddressNumber": "32",
       "AddressNumberSuffix": "1/2",
       "StreetName": "Main",
       "StreetNamePostType": "Street"
    }

    Notes: These seem like results as we'd expect.

  2. 18-D Main Street (method: "parse" and "tag")

    "parts": {
       "AddressNumber": "18-D",
       "StreetName": "Main",
       "StreetNamePostType": "Street"
    }

    Notes: Not as good. It would be more consistent if "AddressNumber": "18" and "AddressNumberSuffix": "D". Not sure if this poses a problem for us if "18-D" is our address number.

  3. 18 D Main Street (method: "parse")

    "parts": {
       "AddressNumber": "18",
       "StreetName": "Main", 
       "StreetNamePostType": "Street"
    }

    Notes: D is dropped all together, not making it into any of the address parts. This is a bit of a surprise.

  4. 18 D Main Street (method: "tag")

    "parts": {
       "AddressNumber": "18", 
       "StreetName": "D Main", 
       "StreetNamePostType": "Street"
    }

    Notes: D Main as StreetName? That seems like a real problem. This is especially surprising considering "tag" generally provides better results that "parse".

Curious to get everyone's take on this. Is it time to start learning more about how to "train" the parser?

hkeeler commented 9 years ago

Rural routes were brought up at today's standup. Below are a few address given as examples Postal Services's Pub28 on Rural Route Addresses:

Original Type USPSBoxGroupType USPSBoxGroupID USPSBoxType USPSBoxID
RR+2+BOX+152 PO Box RR 2 BOX 152
RR+9+BOX+23A PO Box RR 9 BOX 23A
RR03+BOX+98D PO Box RR03 BOX 98D
RR+4+BOX+19-1A PO Box RR 4 BOX 19-1A

Overall, looks "pretty good". The parser clearly doesn't split up USPSBoxID if there are sub-parts, but maybe that's fine. The Leading Zero address, which doesn't parse correctly, is considered "Acceptable", but not "Preferred".

There are other "Incorrect" formats such as Designations RFD and RD and Additional Designations. These formats are not parsed well, but I suppose that is to be expected considering the USPS doesn't consider them to be valid in the first place.

hkeeler commented 9 years ago

Also, if anyone else would like to test the parser's default behavior, usaddress has a site for single and bulk address parsing.

debseidner commented 9 years ago

@hkeeler In addition to RR, I also found that there are highway contract routes or star routes (HC 68 BOX 23A or HC 68 BOX 19-2B). We should also take into account suites for business addresses (12 E MAIN AVE STE 209) or addresses like in DC with the directional after the street (1275 First Street NE).

hkeeler commented 9 years ago

Similar to RR, the HC addresses parse as follows:

Original Type USPSBoxGroupType USPSBoxGroupID USPSBoxType USPSBoxID
HC+68+BOX+23A PO Box HC 68 BOX 23A
HC+68+BOX+19-2B PO Box HC 68 BOX 19-2B

Suites and DC address in their standard form also parse as expected:

Original Type AddressNumber StreetNamePreDirectional StreetName StreetNamePostType OccupancyType OccupancyIdentifier
12+E+MAIN+AVE+STE+209 Street Address 12 E MAIN AVE STE 209
Original Type AddressNumber StreetName StreetNamePostType StreetNamePostDirectional
1275+First+Street+NE Street Address 1275 First Street NE

Its interesting that the E in the Suite address is considered a StreetNamePreDirectional, while the DC address's NE is a StreetNamePostDirectional.

hkeeler commented 9 years ago

I've been playing with training the parser a bit. It's pretty easy. The instructions are laid out pretty well here:

I've been able to add to the training data (defaults to training/labeled.xml), and see the parse results change. Its also interesting that they have other training files available in the [training]() directory. Its not yet clear if the packaged version of the library uses just labeled.xml, all of the files, or some combination. I suspect is not all since one of the files fails to load when I try to train with it.

Another interesting discovery is that it doesn't seem possible to train the parser to parse addresses with concatenated parts, such as:

RR03 BOX 98D

This is due to the parsers hard-coded set of tokens used to split the address parts. To get around this, we (or they) would need to change the tokenize function to include other tokens. I'm not quite ready to go down this road just yet.

And finally, I did a bit of testing for Salt Lake City with their unique addressing scheme.

>>> import usaddress
>>> address = '5268 S 2200 E # 12'
>>> usaddress.tag(address)
(OrderedDict([
 ('AddressNumber', u'5268'), 
 ('StreetNamePreDirectional', u'S'), 
 ('StreetName', u'2200'), 
 ('StreetNamePostDirectional', u'E'), 
 ('OccupancyIdentifier', u'# 12')]),
 'Street Address')

According to the SLC's example from their Standardization page, this address should parse to:

These don't exactly line up, and I'm not exactly sure how these date parts could me made to align with usaddress's address parts. Perhaps this is good enough? Thoughts?

hkeeler commented 9 years ago

Since the question came up at today's standup, usaddress's address parts are based on the United States Thoroughfare, Landmark, and Postal Address Data Standard. This standard does reference USPS Pub 28 throughout, so the two are not mutually exclusive. One of their objectives is:

Build on USPS Publication 28, the Census Bureau TIGER files, the FGDC Content Standard for Digital Geospatial Metadata, the FGDC's National Spatial Data Infrastructure (NSDI) Framework Data Content Standard, and previous FGDC address standard efforts.

hkeeler commented 9 years ago

There were recently questions about how the parser handles Puerto Rico and Overseas Military addresses. The usaddress behaves as follows by default:

Military

Below are the sample addresses taken from Pub 28 Military Addresses. The results are...not so good.

Original Type OccupancyType OccupancyIdentifier USPSBoxType USPSBoxID PlaceName StateName ZipCode Unable to parse LandmarkName
unit+2050+box+4190+apo+ap+96278-2050 PO Box unit 2050 box 4190 apo ap 96278-2050
psc+802+box+74+apo+ae+09499-2050 Unparsed psc 802 box 74 apo ae 09499-2050
uscgc+hamilton+fpo+ap+96667-3931 Ambiguous fpo ap 96667-3931 uscgc hamilton

Puerto Rico

Below are sample addresses taken from Pub 28 Puerto Rico Addresses. They clearly parse much better than the military addresses. The one at the end that does not parse is considered an "exception".

Certain condominiums are not located on a named street or have an assigned number to the building. The name of the condominium is substituted for the street name.

Original Type AddressNumber StreetName OccupancyType OccupancyIdentifier PlaceName StateName ZipCode Recipient StreetNamePostType Unable to parse
1234+ave+ashford+apt+1a+san+juan+pr+00907-1021 Street Address 1234 ave ashford apt 1a san juan pr 00907-1021
1230+calle+amapolas+apt+103+carolina+pr+00979-1126 Street Address 1230 calle amapolas apt 103 carolina pr 00979-1126
1234+urb+los+olmos+ponce+pr+00731-1235 Street Address 1234 urb los olmos ponce pr 00731-1235
urb+las+gladiolas+150+calle+a san+juan+pr+00926-0221 Street Address 150 calle a san juan pr 00926-0221 urb las gladiolas
1234+calle+aurora+mayagues+pr+00680-1234 Street Address 1234 calle mayagues pr 00680-1234 aurora
res+las+margaritas+edif+1+apt+104+caguas+pr+00725-1103 Unparsed res las margaritas edif 1 apt 104 caguas pr 00725-1103

...and the good news is that these types of addresses seem very "trainable", so if we do feel like we need to improve these results, we have options.