datamade / usaddress

:us: a python library for parsing unstructured United States address strings into address components
https://parserator.datamade.us/usaddress
MIT License
1.53k stars 304 forks source link

us-ia-linn labeling errors #46

Closed cathydeng closed 9 years ago

cathydeng commented 10 years ago

tested the parser on openaddress data for Linn county Iowa. 1836 failures out of 95164 addresses (1.9%).

since these are all the addresses within a county, many of the failures are essentially the same errors, repeated for various address/unit numbers. I scrolled through all the failures, and here's a representative sample (address string, predicted labels, true labels):

1.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('180 S 19th Street Ct Marion IA 52302\n ', ('AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

2.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1115 Indian Creek Cir Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

3.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('3942 21st Avenue Pl SW Unit 4 Cedar Rapids IA 52404\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

4.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('4101 16th Ave SW Trlr 26 Cedar Rapids IA 52404\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'StreetNamePreType', 'StreetName', 'StreetName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

5.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1341 39th Street Pl Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

6.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('2510 Heather View Cir Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

7.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('916 E Ave NW Cedar Rapids IA 52405\n ', ('AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

8.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1113 6th St SE D Cedar Rapids IA 52401\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

9.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1238 O Avenue Pl NE Cedar Rapids IA 52402\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

10.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1300 Oakland Rd NE Bldg 5 Cedar Rapids IA 52402\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'StreetNamePreType', 'StreetName', 'StreetName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

11.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('3043 1/2 Leonard St NE Cedar Rapids IA 52402\n ', ('AddressNumber', 'AddressNumberSuffix', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'AddressNumber', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

12.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('1702 Hunters Creek Way Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

13.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('9 Chapelridge Cir Apt E Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

14.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('2351 Blairs Ferry Rd NE Bldg S3 Cedar Rapids IA 52402\n ', ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'OccupancyType', 'OccupancyIdentifier', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

15.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('2415 Grey Wolf Hiawatha IA 52233\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

16.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('550 West Side Pl SW Cedar Rapids IA 52404\n ', ('AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'StreetNamePostDirectional', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

17.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('3397 C Avenue Ext Marion IA 52302\n ', ('AddressNumber', 'StreetName', 'StreetNamePostType', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType', 'PlaceName', 'StateName', 'ZipCode')) ———————————————————————————————————

18.

FAIL: test_labeling.TestOpenaddress.test_us_ia_linn('4997 Hwy 13 Central City IA 52214\n ', ('AddressNumber', 'StreetNamePreType', 'StreetName', 'StreetName', 'PlaceName', 'StateName', 'ZipCode'), ('AddressNumber', 'StreetNamePreType', 'StreetName', 'PlaceName', 'PlaceName', 'StateName', 'ZipCode'))

fgregg commented 10 years ago

I'd like us to use these failures to find real addresses on the internet and use those read addresses to train usaddress.

For example, uaddress is confused by "West Side Pl SW" so we can search https://www.google.com/?gws_rd=ssl#q=%22west+side+pl+sw%22 and we find "551 WEST SIDE PL SW"

Or "Street Ct": https://www.google.com/?gws_rd=ssl#q=19th+Street+Ct gets us to 175 S 19th Street Ct, Marion IA, 52302

You are finding examples where lack of punctuation between address parts confuses usaddrses, find real examples that have that pattern.

Make sense?

I'm also not sure whether to keep all these iowa addresses as tests. Maybe we should use none of them, maybe only the hard ones? What do you think?

fgregg commented 10 years ago

In the iowa data file, address number suffixes are mislabeled as addresses.

809 1/2 16th St SE Cedar Rapids IA 52403

fgregg commented 10 years ago

'Bldg A' should be labeled <SubAddressType>Bldg</SubAddressType> <SubAddressIdentifier>A</SubAddressIdentifier>