GreenBuildingRegistry / usaddress-scourgify

Clean US addresses following USPS pub 28 and RESO guidelines
MIT License
201 stars 47 forks source link

`normalize_address_record()` raises unparseable address error when using full street directionals #31

Closed philiporlando closed 8 months ago

philiporlando commented 1 year ago

The below example raises an unparseable address error:

from scourgify import normalize_address_record

address = "38350 40TH ST EAST 100 PALMDALE CA 93552"

normalize_address_record(address)
# scourgify.exceptions.UnParseableAddressError: UNPARSEABLE ADDRESS: Unable to break this address into its component parts, OrderedDict([('address_line_1', '38350 40TH ST EAST 100 PALMDALE CA 93552'), ('address_line_2', None), ('city', None), ('state', None), ('postal_code', None)])

Abbreviating the street directional value (changing EAST to E) avoids this error and produces the expected results:

from scourgify import normalize_address_record

address = "38350 40TH ST E 100 PALMDALE CA 93552"

normalize_address_record(address)
# OrderedDict([('address_line_1', '38350 40TH ST E'), ('address_line_2', 'UNIT 100'), ('city', 'PALMDALE'), ('state', 'CA'), ('postal_code', '93552')])

Is it possible to look into this and ensure that full directional names do not raise unparseable address errors? The USPS prefers abbreviated directionals, but still considers full names acceptable.

Please let me know if you have any questions about this. Thank you in advance for your help troubleshooting this!

zak-flex commented 1 year ago

I have a similar issue with this address: 1345 Towne Lake Hills South Drive, Woodstock, GA, 30189 This variation is parseable: 1345 Towne Lake Hills S Dr, Woodstock, GA, 30189'=

fablet commented 9 months ago

Unfortunately, this is an issue with the usaddress package. You can check tagging behaviors in their UI: https://parserator.datamade.us/usaddress/ The usaddress.tag results are this:

PARSED TOKENS:    [('38350', 'AddressNumber'), ('40TH', 'StreetName'), ('ST', 'StreetNamePostType'), ('EAST', 'StreetNamePreDirectional'), ('100', 'StreetName'), ('PALMDALE', 'PlaceName'), ('CA', 'StateName'), ('93552', 'ZipCode')]
UNCERTAIN LABEL:  StreetName```

You can see usaddress is incorrectly identifying the post-directional as a pre-directional, which is causing it to identify the street name a second time.

VS `38350 40TH ST E 100 PALMDALE CA 93552`

(OrderedDict([('AddressNumber', '38350'), ('StreetName', '40TH'), ('StreetNamePostType', 'ST'), ('StreetNamePostDirectional', 'E'), ('OccupancyIdentifier', '100'), ('PlaceName', 'PALMDALE'), ('StateName', 'CA'), ('ZipCode', '93552')]), 'Street Address')


 This issue needs to be resubmitted to that package: https://github.com/datamade/usaddress/issues
philiporlando commented 8 months ago

@fablet, I appreciate your input. Like you, I also encountered the parsing error using the Parserator API at https://parserator.datamade.us/usaddress.

However, I've successfully used usaddress.parse() with the address "38350 40TH ST EAST 100 PALMDALE CA 93552" with usaddress version 0.5.10:

import usaddress

address = "38350 40TH ST EAST 100 PALMDALE CA 93552"

print(usaddress.parse(address))

# [('38350', 'AddressNumber'), ('40TH', 'StreetName'), ('ST', 'StreetNamePostType'), ('EAST', 'StreetNamePreDirectional'), ('100', 'StreetName'), ('PALMDALE', 'PlaceName'), ('CA', 'StateName'), ('93552', 'ZipCode')]

It seems the latest version of usaddress might have resolved this pre- vs post-directional issue, however, I'm uncertain about the usaddress version utilized by the Parserator API,

Unfortunately, usaddress.tag() now raises a duplicate street name error when using the latest version:

import usaddress

address = "38350 40TH ST EAST 100 PALMDALE CA 93552"

usaddress.tag(address)

# Traceback (most recent call last):
#   File "/home/user/usaddress_parse_error/usaddress_parse_error.py", line 5, in <module>
#     usaddress.tag(address)
#   File "/home/user/.cache/pypoetry/virtualenvs/usaddress-parse-error-aadNbsKj-py3.10/lib/python3.10/site-packages/usaddress/__init__.py", line 177, in tag
#     raise RepeatedLabelError(address_string, parse(address_string),
# usaddress.RepeatedLabelError: 
# ERROR: Unable to tag this string because more than one area of the string has the same label

# ORIGINAL STRING:  38350 40TH ST EAST 100 PALMDALE CA 93552
# PARSED TOKENS:    [('38350', 'AddressNumber'), ('40TH', 'StreetName'), ('ST', 'StreetNamePostType'), ('EAST', 'StreetNamePreDirectional'), ('100', 'StreetName'), ('PALMDALE', 'PlaceName'), ('CA', 'StateName'), ('93552', 'ZipCode')]
# UNCERTAIN LABEL:  StreetName

# When this error is raised, it's likely that either (1) the string is not a valid person/corporation name or (2) some tokens were labeled incorrectly

# To report an error in labeling a valid name, open an issue at https://github.com/datamade/usaddress/issues/new - it'll help us continue to improve probablepeople!

# For more information, see the documentation at https://usaddress.readthedocs.io/

So it seems that we are trading one parsing error for another. That being said, the newest version of usaddress.parse() is working for me, which is the function that I need for my business case.

Do you know if there are plans to update usaddress-scourgify's dependency on usaddress from 0.5.9 to 0.5.10 in the near future? I hoping that this would avoid the error I'm seeing with normalize_address_record().

Thank you again for assisting with this issue.

philiporlando commented 8 months ago

Ok, I just tried forking this repo and updating its usaddress dependency to 0.5.10.

Unfortunately, this did not resolve my issues:

from scourgify import normalize_address_record

address = "38350 40TH ST EAST 100 PALMDALE CA 93552"

normalize_address_record(address)

# Traceback (most recent call last):
#   File "/home/user/usaddress_parse_error/usaddress_parse_error.py", line 5, in <module>
#     normalize_address_record(address)
#   File "/home/user/.cache/pypoetry/virtualenvs/usaddress-parse-error-aadNbsKj-py3.10/lib/python3.10/site-packages/scourgify/normalize.py", line 159, in normalize_address_record
#     return normalize_addr_str(
#   File "/home/user/.cache/pypoetry/virtualenvs/usaddress-parse-error-aadNbsKj-py3.10/lib/python3.10/site-packages/scourgify/normalize.py", line 267, in normalize_addr_str
#     raise UnParseableAddressError(None, None, addr_rec)
# scourgify.exceptions.UnParseableAddressError: UNPARSEABLE ADDRESS: Unable to break this address into its component parts, OrderedDict([('address_line_1', '38350 40TH ST EAST 100 PALMDALE CA 93552'), ('address_line_2', None), ('city', None), ('state', None), ('postal_code', None)])

It probably makes the most sense to open a new issue within the usaddress repo and try to address the error with usaddress.tag().

philiporlando commented 8 months ago

@fablet, I've opened this issue to address the root of the problem. Thanks again for the support.