datamade / usaddress

:us: a python library for parsing unstructured United States address strings into address components
https://parserator.datamade.us/usaddress
MIT License
1.53k stars 304 forks source link

Repeated Label Error for valid addresses #117

Closed vamsiemani closed 7 years ago

vamsiemani commented 8 years ago

ORIGINAL STRING: 5875 Castle Creek Parkway North Dr Ste 285 PARSED TOKENS: [(u'5875', 'AddressNumber'), (u'Castle', 'StreetName'), (u'Creek', 'StreetName'), (u'Parkway', 'StreetNamePostType'), (u'North', 'StreetNamePostDirectional'), (u'Dr', 'StreetNamePostType'), (u'Ste', 'OccupancyType'), (u'285', 'OccupancyIdentifier')] UNCERTAIN LABEL: StreetNamePostType

http://whitefinder.com/indianapolis-in/maps-financial-services-llc-3175775180.html

Another example with Occupancy Identifier:

ORIGINAL STRING: 1329 N Illinois Route 3, Ste 3, Waterloo, IL 62298 PARSED TOKENS: [(u'1329', 'AddressNumber'), (u'N', 'StreetNamePreDirectional'), (u'Illinois', 'StreetName'), (u'Route', 'StreetNamePostType'), (u'3,', 'OccupancyIdentifier'), (u'Ste', 'OccupancyType'), (u'3,', 'OccupancyIdentifier'), (u'Waterloo,', 'PlaceName'), (u'IL', 'StateName'), (u'62298', 'ZipCode')] UNCERTAIN LABEL: OccupancyIdentifier

https://local.yahoo.com/info-85444392-sidebarr-technologies-waterloo?csz=Frohna%2C+MO&stx=Computer+Repair

Confused with Person Name vs Directional keyword:

ORIGINAL STRING: 300 Frank W. Burr Boulevard Teaneck, New Jersey 07666 PARSED TOKENS: [(u'300', 'AddressNumber'), (u'Frank', 'StreetName'), (u'W.', 'StreetNamePostDirectional'), (u'Burr', 'StreetName'), (u'Boulevard', 'StreetNamePostType'), (u'Teaneck,', 'PlaceName'), (u'New', 'StateName'), (u'Jersey', 'StateName'), (u'07666', 'ZipCode')] UNCERTAIN LABEL: StreetName http://www.vision-institute.com/new-jersey/patient-information/directions.htm

Confused with multiple SubAddresstypes:

ORIGINAL STRING: 3565 Piedmont Rd Driveway A Bldg 3 Ste 415 PARSED TOKENS: [(u'3565', 'AddressNumber'), (u'Piedmont', 'StreetName'), (u'Rd', 'StreetNamePostType'), (u'Driveway', 'SubaddressType'), (u'A', 'SubaddressIdentifier'), (u'Bldg', 'SubaddressType'), (u'3', 'SubaddressIdentifier'), (u'Ste', 'OccupancyType'), (u'415', 'OccupancyIdentifier')] UNCERTAIN LABEL: SubaddressType

REWDevinMcBeth commented 8 years ago

ORIGINAL STRING: 5875 Castle Creek Parkway North Dr Ste 285 ORIGINAL STRING: 3565 Piedmont Rd Driveway A Bldg 3 Ste 415

These look like they are parsing correctly to me. The only one that may be wrong is this one:

ORIGINAL STRING: 300 Frank W. Burr Boulevard Teaneck, New Jersey 07666

Where the W. should be part of the street name as it is a Human Named street.

This one also looks wrong:

ORIGINAL STRING: 1329 N Illinois Route 3, Ste 3, Waterloo, IL 62298

Route 3 shouldn't be Occupancy#.

jeancochrane commented 7 years ago

Happy birthday to this issue! I agree with @REWDevinMcBeth – in the first two cases, parse succeeds where tag fails. This is because tag automatically tries to concatenate tokens that correspond to the same grouping, but those two addresses are complex enough to have two sets of one grouping (e.g. Driveway and Bldg are both SubaddressTypes, but they correspond to distinct subaddresses that cannot be concatenated). In cases like these, you should either use parse or build in some logic to your code to handle the exception (see the docs for more). Hope that makes sense!

Adding the failed addresses to our training data and closing this.