datamade / usaddress

:us: a python library for parsing unstructured United States address strings into address components
https://parserator.datamade.us/usaddress
MIT License
1.51k stars 303 forks source link

Address Parsing Errors #278

Open MrFuguDataScience opened 4 years ago

MrFuguDataScience commented 4 years ago

When I create a large address book to parse, there is an error I get from parsing python Faker addresses. There are times when PO BOX doesn't work or weird addresses it creates. When I use small datasets with python faker it works as soon as I get into say 6000, I get problems.

Here is what I have as output:

-------------------------------------------

RepeatedLabelError: ERROR: Unable to tag this string because more than one area of the string has the same label

ORIGINAL STRING: PSC 7500, Box 2471 APO AE 53806 PARSED TOKENS: [('PSC', 'USPSBoxType'), ('7500,', 'USPSBoxID'), ('Box', 'USPSBoxType'), ('2471\n', 'USPSBoxID'), ('APO', 'PlaceName'), ('AE', 'StateName'), ('53806', 'ZipCode')] UNCERTAIN LABEL: USPSBoxType

When this error is raised, it's likely that either (1) the string is not a valid person/corporation name or (2) some tokens were labeled incorrectly

To report an error in labeling a valid name, open an issue at https://github.com/datamade/usaddress/issues/new - it'll help us continue to improve probablepeople!