datamade / usaddress

:us: a python library for parsing unstructured United States address strings into address components
https://parserator.datamade.us/usaddress
MIT License
1.51k stars 303 forks source link

Uncertain Label Error: Potentially Valid Address Format #313

Open Baugus opened 2 years ago

Baugus commented 2 years ago

I ran into the "UNCERTAIN LABEL" error on a case where I do not understand why it is choking, and per the guidance, I am submitting an issue. (Never submitted an issue, so if I am totally doing this wrong, please let me know!)

Two addresses: 1) RT 2 BX 565 CRYSTAL SPRINGS MS 390590000 (I know the trailing zeros are not great, still just experimenting.) 2) RT 4 BX 9 WESSON MS 391910000

First address one goes through .parse and .tag just fine: parse = "[('RT', 'USPSBoxGroupType'), ('2', 'USPSBoxGroupID'), ('BX', 'USPSBoxType'), ('565,', 'USPSBoxID'), ('CRYSTAL', 'PlaceName'), ('SPRINGS,', 'PlaceName'), ('MS,', 'StateName'), ('390590000', 'ZipCode')]" tag = "(OrderedDict([('USPSBoxGroupType', 'RT'), ('USPSBoxGroupID', '2'), ('USPSBoxType', 'BX'), ('USPSBoxID', '565'), ('PlaceName', 'CRYSTAL SPRINGS'), ('StateName', 'MS'), ('ZipCode', '390590000')]), 'PO Box')"

Second address appears to parse just fine, but throws the UNCERTAIN LABEL error on .tag: parse = "[('RT', 'USPSBoxGroupType'), ('4', 'USPSBoxGroupID'), ('BX', 'USPSBoxGroupType'), ('9,', 'USPSBoxID'), ('WESSON,', 'PlaceName'), ('MS,', 'StateName'), ('391910000', 'ZipCode')]" tag = "ERROR: Unable to tag this string because more than one area of the string has the same label

ORIGINAL STRING: RT 4 BX 9, WESSON, MS, 391910000 PARSED TOKENS: [('RT', 'USPSBoxGroupType'), ('4', 'USPSBoxGroupID'), ('BX', 'USPSBoxGroupType'), ('9,', 'USPSBoxID'), ('WESSON,', 'PlaceName'), ('MS,', 'StateName'), ('391910000', 'ZipCode')] UNCERTAIN LABEL: USPSBoxGroupType

When this error is raised, it's likely that either (1) the string is not a valid person/corporation name or (2) some tokens were labeled incorrectly

To report an error in labeling a valid name, open an issue at https://github.com/datamade/usaddress/issues/new - it'll help us continue to improve probablepeople!"

I can supply several other instances of pairs like this in the same "RT . . . BX" format.

This really is not an issue for my use case, but I saw the request to report errors in labeling (and wasn't entirely sure why one was tagging and the other was not) so I thought I would submit it, just in case it helps you guys out.

Let me know if you need more info/data.