datamade / usaddress

:us: a python library for parsing unstructured United States address strings into address components
https://parserator.datamade.us/usaddress
MIT License
1.53k stars 303 forks source link

API results different from local #72

Closed shahin closed 9 years ago

shahin commented 9 years ago

Thanks for a great tool. I'm puzzled, though -- results from the API look different from local results using the PyPI package or c2a0fe. Local results:

In [1]: import usaddress

In [2]: usaddress.parse('140 metro park suite a rochester ny 14623')
Out[2]: 
[('140', 'OccupancyIdentifier'),
 ('metro', 'BuildingName'),
 ('park', 'BuildingName'),
 ('suite', 'OccupancyType'),
 ('a', 'OccupancyIdentifier'),
 ('rochester', 'PlaceName'),
 ('ny', 'StateName'),
 ('14623', 'ZipCode')]

API results:

{
    "input-address": "140 metro park suite a rochester ny 14623",
    "result": [
        {
            "value": "140",
            "tag": "AddressNumber"
        },
        {
            "value": "metro",
            "tag": "StreetName"
        },
        {
            "value": "park",
            "tag": "StreetName"
        },
        {
            "value": "suite",
            "tag": "OccupancyType"
        },
        {
            "value": "a",
            "tag": "OccupancyIdentifier"
        },
        {
            "value": "rochester",
            "tag": "PlaceName"
        },
        {
            "value": "ny",
            "tag": "StateName"
        },
        {
            "value": "14623",
            "tag": "ZipCode"
        }
    ]
}

From source, I trained locally using:

$ parserator train training/labeled.xml usaddress
training model on 953 training examples
...

One test fails:

FAIL: test_labeling.TestSyntheticAddresses.test_synthetic_addresses('550 West Van Buren Street, 60661', ('AddressNumber', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType', 'ZipCode'), ('AddressNumber', 'StreetNamePreDirec
tional', 'StreetName', 'StreetName', 'StreetNamePostType', 'ZipCode'))
----------------------------------------------------------------------
Traceback (most recent call last):  File "/Users/ssaneinejad/.virtualenvs/usaddress/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)  File "/Users/ssaneinejad/Projects/usaddress/tests/test_labeling.py", line 44, in equals
    assert labels_pred == labels_trueAssertionError: 
-------------------- >> begin captured stdout << ---------------------
ADDRESS:     550 West Van Buren Street, 60661
fuzzy pred:  ('AddressNumber', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType', 'ZipCode'
true:        ('AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetNamePostType', 'ZipCode')
--------------------- >> end captured stdout << ----------------------

Results from source were the same as from PyPI. Results from the API look generally much better and I'd love to replicate them locally.

Is the package or training data behind the API newer than in the repo, or am I doing something wrong? Thanks!

fgregg commented 9 years ago

The training data in the repo is newer than what's in the API, and we just added some pretty complicated address examples, and it looks like we have some regressions. We'll smooth it out today.

cathydeng commented 9 years ago

hey @shahin!

I've updated the training data - the updated version is in the master branch, as well as on pypi (version 0.2.8). I checked some addresses on my end, and it looks like it's performing better. Do you wana check if it's performing better on your addresses, relative to the api? The api is still using version 0.2.4

If you're still getting better performance w/ the api, can you send over some examples of addresses that the current usaddress parser is failing on?

shahin commented 9 years ago

Thanks @cathydeng. I'm getting great results now, but for an interesting reason:

It looks like most issues in my sample are due to upper/lower case handling in 0.2.8/7 vs 0.2.4. I've pasted some examples below on 0.2.8 where cased strings do better than uncased strings. In 0.2.4, uncased strings get the same (good) results as cased strings.

In 0.2.8:

In [2]: usaddress.parse(u'140 Metro park suite a rochester ny 14623')
Out[2]: 
[(u'140', 'AddressNumber'),
 (u'Metro', 'StreetName'),
 (u'park', 'StreetName'),
 (u'suite', 'OccupancyType'),
 (u'a', 'OccupancyIdentifier'),
 (u'rochester', 'PlaceName'),
 (u'ny', 'StateName'),
 (u'14623', 'ZipCode')]

In [3]: usaddress.parse(u'140 Metro park suite a rochester ny 14623'.lower())
Out[3]: 
[(u'140', 'OccupancyIdentifier'),
 (u'metro', 'BuildingName'),
 (u'park', 'BuildingName'),
 (u'suite', 'OccupancyType'),
 (u'a', 'OccupancyIdentifier'),
 (u'rochester', 'PlaceName'),
 (u'ny', 'StateName'),
 (u'14623', 'ZipCode')]

In [4]: usaddress.parse(u'1101 Teller Rd 1 Cripple Creek CO 80813')                                                                                                                                                                        
Out[4]: 
[(u'1101', 'AddressNumber'),
 (u'Teller', 'StreetName'),
 (u'Rd', 'StreetNamePostType'),
 (u'1', 'OccupancyIdentifier'),
 (u'Cripple', 'PlaceName'),
 (u'Creek', 'PlaceName'),
 (u'CO', 'StateName'),
 (u'80813', 'ZipCode')]

In [5]: usaddress.parse(u'1101 Teller Rd 1 Cripple Creek CO 80813'.lower())
Out[5]: 
[(u'1101', 'AddressNumber'),
 (u'teller', 'StreetName'),
 (u'rd', 'StreetNamePostType'),
 (u'1', 'BuildingName'),
 (u'cripple', 'BuildingName'),
 (u'creek', 'PlaceName'),
 (u'co', 'StateName'),
 (u'80813', 'ZipCode')]

In [6]: usaddress.parse(u'17030 Lakeside Hills Plz suite 110 Omaha NE 68130')                                                                                                                                                              
Out[6]: 
[(u'17030', 'AddressNumber'),
 (u'Lakeside', 'StreetName'),
 (u'Hills', 'StreetName'),
 (u'Plz', 'StreetNamePostType'),
 (u'suite', 'OccupancyType'),
 (u'110', 'OccupancyIdentifier'),
 (u'Omaha', 'PlaceName'),
 (u'NE', 'StateName'),
 (u'68130', 'ZipCode')]

In [7]: usaddress.parse(u'17030 Lakeside Hills Plz suite 110 Omaha NE 68130'.lower())
Out[7]: 
[(u'17030', 'BuildingName'),
 (u'lakeside', 'BuildingName'),
 (u'hills', 'BuildingName'),
 (u'plz', 'BuildingName'),
 (u'suite', 'OccupancyType'),
 (u'110', 'OccupancyIdentifier'),
 (u'omaha', 'PlaceName'),
 (u'ne', 'StateName'),
 (u'68130', 'ZipCode')]

In [8]: usaddress.parse(u'12800 Bothell Everett Hwy suite 120 Everett WA 98208')                                                                                                                                                           
Out[8]: 
[(u'12800', 'AddressNumber'),
 (u'Bothell', 'StreetName'),
 (u'Everett', 'StreetName'),
 (u'Hwy', 'StreetNamePostType'),
 (u'suite', 'OccupancyType'),
 (u'120', 'OccupancyIdentifier'),
 (u'Everett', 'PlaceName'),
 (u'WA', 'StateName'),
 (u'98208', 'ZipCode')]

In [9]: usaddress.parse(u'12800 Bothell Everett Hwy suite 120 Everett WA 98208'.lower())
Out[9]: 
[(u'12800', 'BuildingName'),
 (u'bothell', 'BuildingName'),
 (u'everett', 'BuildingName'),
 (u'hwy', 'BuildingName'),
 (u'suite', 'OccupancyType'),
 (u'120', 'OccupancyIdentifier'),
 (u'everett', 'PlaceName'),
 (u'wa', 'StateName'),
 (u'98208', 'ZipCode')]

In [10]: usaddress.parse(u'223 Chief Justice Cushing Hwy Cohasset MA 02025')                                                                                                                                                               
Out[10]: 
[(u'223', 'AddressNumber'),
 (u'Chief', 'StreetName'),
 (u'Justice', 'StreetName'),
 (u'Cushing', 'StreetName'),
 (u'Hwy', 'StreetNamePostType'),
 (u'Cohasset', 'PlaceName'),
 (u'MA', 'StateName'),
 (u'02025', 'ZipCode')]

In [11]: usaddress.parse(u'223 Chief Justice Cushing Hwy Cohasset MA 02025'.lower())
Out[11]: 
[(u'223', 'BuildingName'),
 (u'chief', 'BuildingName'),
 (u'justice', 'BuildingName'),
 (u'cushing', 'BuildingName'),
 (u'hwy', 'BuildingName'),
 (u'cohasset', 'PlaceName'),
 (u'ma', 'StateName'),
 (u'02025', 'ZipCode')]