Closed shahin closed 9 years ago
The training data in the repo is newer than what's in the API, and we just added some pretty complicated address examples, and it looks like we have some regressions. We'll smooth it out today.
hey @shahin!
I've updated the training data - the updated version is in the master branch, as well as on pypi (version 0.2.8). I checked some addresses on my end, and it looks like it's performing better. Do you wana check if it's performing better on your addresses, relative to the api? The api is still using version 0.2.4
If you're still getting better performance w/ the api, can you send over some examples of addresses that the current usaddress parser is failing on?
Thanks @cathydeng. I'm getting great results now, but for an interesting reason:
It looks like most issues in my sample are due to upper/lower case handling in 0.2.8/7 vs 0.2.4. I've pasted some examples below on 0.2.8 where cased strings do better than uncased strings. In 0.2.4, uncased strings get the same (good) results as cased strings.
In 0.2.8:
In [2]: usaddress.parse(u'140 Metro park suite a rochester ny 14623')
Out[2]:
[(u'140', 'AddressNumber'),
(u'Metro', 'StreetName'),
(u'park', 'StreetName'),
(u'suite', 'OccupancyType'),
(u'a', 'OccupancyIdentifier'),
(u'rochester', 'PlaceName'),
(u'ny', 'StateName'),
(u'14623', 'ZipCode')]
In [3]: usaddress.parse(u'140 Metro park suite a rochester ny 14623'.lower())
Out[3]:
[(u'140', 'OccupancyIdentifier'),
(u'metro', 'BuildingName'),
(u'park', 'BuildingName'),
(u'suite', 'OccupancyType'),
(u'a', 'OccupancyIdentifier'),
(u'rochester', 'PlaceName'),
(u'ny', 'StateName'),
(u'14623', 'ZipCode')]
In [4]: usaddress.parse(u'1101 Teller Rd 1 Cripple Creek CO 80813')
Out[4]:
[(u'1101', 'AddressNumber'),
(u'Teller', 'StreetName'),
(u'Rd', 'StreetNamePostType'),
(u'1', 'OccupancyIdentifier'),
(u'Cripple', 'PlaceName'),
(u'Creek', 'PlaceName'),
(u'CO', 'StateName'),
(u'80813', 'ZipCode')]
In [5]: usaddress.parse(u'1101 Teller Rd 1 Cripple Creek CO 80813'.lower())
Out[5]:
[(u'1101', 'AddressNumber'),
(u'teller', 'StreetName'),
(u'rd', 'StreetNamePostType'),
(u'1', 'BuildingName'),
(u'cripple', 'BuildingName'),
(u'creek', 'PlaceName'),
(u'co', 'StateName'),
(u'80813', 'ZipCode')]
In [6]: usaddress.parse(u'17030 Lakeside Hills Plz suite 110 Omaha NE 68130')
Out[6]:
[(u'17030', 'AddressNumber'),
(u'Lakeside', 'StreetName'),
(u'Hills', 'StreetName'),
(u'Plz', 'StreetNamePostType'),
(u'suite', 'OccupancyType'),
(u'110', 'OccupancyIdentifier'),
(u'Omaha', 'PlaceName'),
(u'NE', 'StateName'),
(u'68130', 'ZipCode')]
In [7]: usaddress.parse(u'17030 Lakeside Hills Plz suite 110 Omaha NE 68130'.lower())
Out[7]:
[(u'17030', 'BuildingName'),
(u'lakeside', 'BuildingName'),
(u'hills', 'BuildingName'),
(u'plz', 'BuildingName'),
(u'suite', 'OccupancyType'),
(u'110', 'OccupancyIdentifier'),
(u'omaha', 'PlaceName'),
(u'ne', 'StateName'),
(u'68130', 'ZipCode')]
In [8]: usaddress.parse(u'12800 Bothell Everett Hwy suite 120 Everett WA 98208')
Out[8]:
[(u'12800', 'AddressNumber'),
(u'Bothell', 'StreetName'),
(u'Everett', 'StreetName'),
(u'Hwy', 'StreetNamePostType'),
(u'suite', 'OccupancyType'),
(u'120', 'OccupancyIdentifier'),
(u'Everett', 'PlaceName'),
(u'WA', 'StateName'),
(u'98208', 'ZipCode')]
In [9]: usaddress.parse(u'12800 Bothell Everett Hwy suite 120 Everett WA 98208'.lower())
Out[9]:
[(u'12800', 'BuildingName'),
(u'bothell', 'BuildingName'),
(u'everett', 'BuildingName'),
(u'hwy', 'BuildingName'),
(u'suite', 'OccupancyType'),
(u'120', 'OccupancyIdentifier'),
(u'everett', 'PlaceName'),
(u'wa', 'StateName'),
(u'98208', 'ZipCode')]
In [10]: usaddress.parse(u'223 Chief Justice Cushing Hwy Cohasset MA 02025')
Out[10]:
[(u'223', 'AddressNumber'),
(u'Chief', 'StreetName'),
(u'Justice', 'StreetName'),
(u'Cushing', 'StreetName'),
(u'Hwy', 'StreetNamePostType'),
(u'Cohasset', 'PlaceName'),
(u'MA', 'StateName'),
(u'02025', 'ZipCode')]
In [11]: usaddress.parse(u'223 Chief Justice Cushing Hwy Cohasset MA 02025'.lower())
Out[11]:
[(u'223', 'BuildingName'),
(u'chief', 'BuildingName'),
(u'justice', 'BuildingName'),
(u'cushing', 'BuildingName'),
(u'hwy', 'BuildingName'),
(u'cohasset', 'PlaceName'),
(u'ma', 'StateName'),
(u'02025', 'ZipCode')]
Thanks for a great tool. I'm puzzled, though -- results from the API look different from local results using the PyPI package or c2a0fe. Local results:
API results:
From source, I trained locally using:
One test fails:
Results from source were the same as from PyPI. Results from the API look generally much better and I'd love to replicate them locally.
Is the package or training data behind the API newer than in the repo, or am I doing something wrong? Thanks!