datamade / usaddress

:us: a python library for parsing unstructured United States address strings into address components
https://parserator.datamade.us/usaddress
MIT License
1.52k stars 304 forks source link

Bad Parsing of Address #339

Open bschollnick opened 1 year ago

bschollnick commented 1 year ago

Using USAddress 0.5.10, under python 3.10.1, using usaddress.tag.

Case 1 - ` usaddress.RepeatedLabelError: ERROR: Unable to tag this string because more than one area of the string has the same label

ORIGINAL STRING: 9999 Walker LK Ontario Road,Hilton, NY 14468,US PARSED TOKENS: [('9999', 'AddressNumber'), ('Walker', 'StreetName'), ('LK', 'StreetNamePostType'), ('Ontario', 'StreetName'), ('Road,', 'StreetNamePostType'), ('Hilton,', 'PlaceName'), ('NY', 'StateName'), ('14468,', 'ZipCode'), ('US', 'CountryName')] UNCERTAIN LABEL: StreetName ` It appears that LK as an abbreviation for LAKE, isn't being processed correctly.

Case 2 - ` usaddress.RepeatedLabelError: ERROR: Unable to tag this string because more than one area of the string has the same label

ORIGINAL STRING: Beech Street Corp PO Box 999999,Richardson, TX 75085-3925,US PARSED TOKENS: [('Beech', 'StreetName'), ('Street', 'StreetNamePostType'), ('Corp', 'PlaceName'), ('PO', 'USPSBoxType'), ('Box', 'USPSBoxType'), ('999999,', 'USPSBoxID'), ('Richardson,', 'PlaceName'), ('TX', 'StateName'), ('75085-3925,', 'ZipCode'), ('US', 'CountryName')] UNCERTAIN LABEL: PlaceName Case 3 - ERROR: Unable to tag this string because more than one area of the string has the same label

ORIGINAL STRING: 99999 Bristol Blue St,Apex, NC 27502 4115,US PARSED TOKENS: [('99999', 'AddressNumber'), ('Bristol', 'StreetName'), ('Blue', 'StreetName'), ('St,', 'StreetNamePostType'), ('Apex,', 'PlaceName'), ('NC', 'StateName'), ('27502', 'ZipCode'), ('4115,', 'ZipPlus4'), ('US', 'StateName')] UNCERTAIN LABEL: StateName ` Changing case 2 to Beech Street Corp, PO Box 999999,Richardson, TX 75085-3925,US does parse, but I'm having issues with devising logic to handle this properly.

I have some situations where there are two address lines, and the parse failed, but succeeded when I removed the comma between the address lines.

Case 3 seems to be unaware of North Carolina?

Can you elaborate on the proper formatting of the input string? (e.g. include commas? Don't include line delimiters?)

The reason I ask is that I am seeing commas at the end of StreetNamePostType, and so forth?