Closed philiporlando closed 8 months ago
I have a similar issue with this address: 1345 Towne Lake Hills South Drive, Woodstock, GA, 30189 This variation is parseable: 1345 Towne Lake Hills S Dr, Woodstock, GA, 30189'=
Unfortunately, this is an issue with the usaddress
package. You can check tagging behaviors in their UI: https://parserator.datamade.us/usaddress/
The usaddress.tag results are this:
PARSED TOKENS: [('38350', 'AddressNumber'), ('40TH', 'StreetName'), ('ST', 'StreetNamePostType'), ('EAST', 'StreetNamePreDirectional'), ('100', 'StreetName'), ('PALMDALE', 'PlaceName'), ('CA', 'StateName'), ('93552', 'ZipCode')]
UNCERTAIN LABEL: StreetName```
You can see usaddress is incorrectly identifying the post-directional as a pre-directional, which is causing it to identify the street name a second time.
VS `38350 40TH ST E 100 PALMDALE CA 93552`
(OrderedDict([('AddressNumber', '38350'), ('StreetName', '40TH'), ('StreetNamePostType', 'ST'), ('StreetNamePostDirectional', 'E'), ('OccupancyIdentifier', '100'), ('PlaceName', 'PALMDALE'), ('StateName', 'CA'), ('ZipCode', '93552')]), 'Street Address')
This issue needs to be resubmitted to that package: https://github.com/datamade/usaddress/issues
@fablet, I appreciate your input. Like you, I also encountered the parsing error using the Parserator API at https://parserator.datamade.us/usaddress.
However, I've successfully used usaddress.parse()
with the address "38350 40TH ST EAST 100 PALMDALE CA 93552"
with usaddress
version 0.5.10:
import usaddress
address = "38350 40TH ST EAST 100 PALMDALE CA 93552"
print(usaddress.parse(address))
# [('38350', 'AddressNumber'), ('40TH', 'StreetName'), ('ST', 'StreetNamePostType'), ('EAST', 'StreetNamePreDirectional'), ('100', 'StreetName'), ('PALMDALE', 'PlaceName'), ('CA', 'StateName'), ('93552', 'ZipCode')]
It seems the latest version of usaddress might have resolved this pre- vs post-directional issue, however, I'm uncertain about the usaddress version utilized by the Parserator API,
Unfortunately, usaddress.tag()
now raises a duplicate street name error when using the latest version:
import usaddress
address = "38350 40TH ST EAST 100 PALMDALE CA 93552"
usaddress.tag(address)
# Traceback (most recent call last):
# File "/home/user/usaddress_parse_error/usaddress_parse_error.py", line 5, in <module>
# usaddress.tag(address)
# File "/home/user/.cache/pypoetry/virtualenvs/usaddress-parse-error-aadNbsKj-py3.10/lib/python3.10/site-packages/usaddress/__init__.py", line 177, in tag
# raise RepeatedLabelError(address_string, parse(address_string),
# usaddress.RepeatedLabelError:
# ERROR: Unable to tag this string because more than one area of the string has the same label
# ORIGINAL STRING: 38350 40TH ST EAST 100 PALMDALE CA 93552
# PARSED TOKENS: [('38350', 'AddressNumber'), ('40TH', 'StreetName'), ('ST', 'StreetNamePostType'), ('EAST', 'StreetNamePreDirectional'), ('100', 'StreetName'), ('PALMDALE', 'PlaceName'), ('CA', 'StateName'), ('93552', 'ZipCode')]
# UNCERTAIN LABEL: StreetName
# When this error is raised, it's likely that either (1) the string is not a valid person/corporation name or (2) some tokens were labeled incorrectly
# To report an error in labeling a valid name, open an issue at https://github.com/datamade/usaddress/issues/new - it'll help us continue to improve probablepeople!
# For more information, see the documentation at https://usaddress.readthedocs.io/
So it seems that we are trading one parsing error for another. That being said, the newest version of usaddress.parse()
is working for me, which is the function that I need for my business case.
Do you know if there are plans to update usaddress-scourgify's dependency on usaddress from 0.5.9 to 0.5.10 in the near future? I hoping that this would avoid the error I'm seeing with normalize_address_record()
.
Thank you again for assisting with this issue.
Ok, I just tried forking this repo and updating its usaddress dependency to 0.5.10.
Unfortunately, this did not resolve my issues:
from scourgify import normalize_address_record
address = "38350 40TH ST EAST 100 PALMDALE CA 93552"
normalize_address_record(address)
# Traceback (most recent call last):
# File "/home/user/usaddress_parse_error/usaddress_parse_error.py", line 5, in <module>
# normalize_address_record(address)
# File "/home/user/.cache/pypoetry/virtualenvs/usaddress-parse-error-aadNbsKj-py3.10/lib/python3.10/site-packages/scourgify/normalize.py", line 159, in normalize_address_record
# return normalize_addr_str(
# File "/home/user/.cache/pypoetry/virtualenvs/usaddress-parse-error-aadNbsKj-py3.10/lib/python3.10/site-packages/scourgify/normalize.py", line 267, in normalize_addr_str
# raise UnParseableAddressError(None, None, addr_rec)
# scourgify.exceptions.UnParseableAddressError: UNPARSEABLE ADDRESS: Unable to break this address into its component parts, OrderedDict([('address_line_1', '38350 40TH ST EAST 100 PALMDALE CA 93552'), ('address_line_2', None), ('city', None), ('state', None), ('postal_code', None)])
It probably makes the most sense to open a new issue within the usaddress repo and try to address the error with usaddress.tag()
.
@fablet, I've opened this issue to address the root of the problem. Thanks again for the support.
The below example raises an unparseable address error:
Abbreviating the street directional value (changing
EAST
toE
) avoids this error and produces the expected results:Is it possible to look into this and ensure that full directional names do not raise unparseable address errors? The USPS prefers abbreviated directionals, but still considers full names acceptable.
Please let me know if you have any questions about this. Thank you in advance for your help troubleshooting this!