datamade / usaddress

:us: a python library for parsing unstructured United States address strings into address components
https://parserator.datamade.us/usaddress
MIT License
1.52k stars 304 forks source link

OccupancyType not recognized #152

Open wangzhixuan opened 7 years ago

wangzhixuan commented 7 years ago

It seems to me that some valid occupancy types are never correctly recognized. For example Lot, TRLR, SPC.

>>> import usaddress
>>> usaddress.tag('451 County Route 11 LOT 56, West Monroe, NY 13167')
(OrderedDict([('AddressNumber', u'451'), ('StreetNamePreType', u'County Route'), ('StreetName', u'11 LOT 56'), ('PlaceName', u'West Monroe'), ('StateName', u'NY'), ('ZipCode', u'13167')]), 'Street Address')
jeancochrane commented 7 years ago

Hey @wangzhixuan,

Thanks for filing this! Seems like a similar issue to the one Ben identified in #132. We've been meaning to provide better support for lots and trailers. Any idea what SPC stands for?

If you want to add your own training data and submit a PR, we have a new guide up for that now. Otherwise, can you paste in a few examples for each occupancy type? 4-6 examples for each pattern should be enough.

wangzhixuan commented 7 years ago

There are actually more cases, but the 3 I mentioned above are the most common occupancy types that usaddress cannot recognize. I remember seeing other SIDE, FRNT, BAY, PH etc. as well.

You can find the meaning of those abbreviations here http://www.expertmarket.com/USPS-street-suffix

fgregg commented 7 years ago

In addition to more training data, we could do something for occupancy types like we did for street types and directionals https://github.com/datamade/usaddress/blob/master/usaddress/__init__.py#L262

wangzhixuan commented 7 years ago

@fgregg I agree with you.

jeancochrane commented 7 years ago

@fgregg @wangzhixuan Agreed!

wangzhixuan commented 7 years ago

@jeancochrane I tried retrain the model myself, but then the nosetests comes up with too many failed cases. I don't have time to look into them so I decide to put my additional training examples here.

"451 County Route 11 LOT 56, West Monroe, NY 13167"
"192 State Highway 1959 Lot 26, Grayson, KY 41143"
"12475 State Highway 180 LOT 26, Gulf Shores, AL 36542"
"W7772 Wisconsin Pkwy Lot 8B, Delavan, Wisconsin, 53115"
"13809 Bandera St Trlr 3, Houston, TX, 77015"
"TRLR 153-168, 8028 Wichita St, Fort Worth, TX 76140"
"6485 Us Highway 10 W Trlr 52 , Missoula, Montana, 59808"
"900 Broken Feather Trl TRLR 324, Pflugerville, TX 78660"
"641 N SCRAPER ST TRLR 3, VINITA OK 74301"
"176 SE COUNTY ROAD Y LOT 25, WARRENSBURG MO 64093"
"3933 E AZ Highway 260 Spc 155. Payson, AZ 85541"
"1624 N Coast Highway 101 Spc 53, Encinitas, CA 92024"
"601 Pacheco Rd SPC 116, Bakersfield, CA, 93307"
"9020 W Avenue J Spc 25, Lancaster, CA 93536"
"351 H Avenue, Building 442, San Francisco, CA 94130"

These are all addresses from internet.

jeancochrane commented 7 years ago

I'm sorry to hear that! If you have time to sort through the testing failures I'd be happy to provide some help troubleshooting. Thanks for pasting the examples here – I can take a stab at a fix when I'm back from vacation in two weeks.