datamade / usaddress

:us: a python library for parsing unstructured United States address strings into address components
https://parserator.datamade.us/usaddress
MIT License
1.53k stars 303 forks source link

StateName wrong detection #217

Closed MD5AI closed 6 years ago

MD5AI commented 6 years ago

Hello. I tested usaddress locally. I found some interesting test cases. For example: This is my generated test case: Front Street North 695 Gilbert Z1 35904

Output: ('Front', 'StreetName'), ('Street', 'StreetNamePostType'), ('North', 'StreetNamePostDirectional'), ('695', 'OccupancyIdentifier'), ('Gilbert', 'PlaceName'), ('Z1', 'StateName'), ('35904', 'ZipCode')

Expected output: ('Front', 'StreetName'), ('Street', 'StreetNamePostType'), ('North', 'StreetNamePostDirectional'), ('695', 'OccupancyIdentifier'), ('Gilbert', 'PlaceName'), ('Z1', ' anything but not State '), ('35904', ' zipcode or anything else')

Example with real world data: Italian address, exactly Sardinia CMR 467 Box 7000 APO, AE 09096

Output: ('CMR', 'USPSBoxType'), ('467', 'USPSBoxID'), ('Box', 'USPSBoxType'), ('7000', 'USPSBoxID'), ('APO,', 'PlaceName'), ('AE', 'StateName'), ('09096', 'ZipCode')

May be by format this address is relevant, but this is not US addresses.

To solve that problem I generated many test cases where it would fail, Should I after-train the model ?

jeancochrane commented 6 years ago

These are some interesting formats @DaniyarDS! Unfortunately, usaddress is limited in scope to addresses in the United States. If you've got a lot of international addresses to parse you might have more success with pypostal.