datamade / usaddress

:us: a python library for parsing unstructured United States address strings into address components
https://parserator.datamade.us/usaddress
MIT License
1.54k stars 304 forks source link

Depreciating performance for company recipients #153

Open jeancochrane opened 7 years ago

jeancochrane commented 7 years ago

Erica Kim notified us over email that the web app (v0.5.5) was performing better for parsing certain company names than the current version (v0.5.8). She's going to keep an eye out for more failures, but so far here's what has failed:

In [4]: usaddress.parse("'GALLY' International Biomedical Research Consulting LLC, 7733 Louis Pasteur Drive, #330. San Antonio")
Out[4]:
[("GALLY'", 'Recipient'),
 ('International', 'Recipient'),
 ('Biomedical', 'Recipient'),
 ('Research', 'Recipient'),
 ('Consulting', 'Recipient'),
 ('LLC,', 'Recipient'),
 ('7733', 'AddressNumber'),
 ('Louis', 'StreetName'),
 ('Pasteur', 'StreetName'),
 ('Drive,', 'StreetNamePostType'),
 ('#', 'Recipient'),
 ('330.', 'Recipient'),
 ('San', 'Recipient'),
 ('Antonio', 'Recipient')]
In [5]: usaddress.parse('New Mexico Department of Game and Fish, One Wildlife Way')
Out[5]:
[('New', 'Recipient'),
 ('Mexico', 'Recipient'),
 ('Department', 'Recipient'),
 ('of', 'Recipient'),
 ('Game', 'Recipient'),
 ('and', 'Recipient'),
 ('Fish,', 'Recipient'),
 ('One', 'Recipient'),
 ('Wildlife', 'Recipient'),
 ('Way', 'Recipient')]

Clearly, it's getting confused about who exactly is the Recipient in these addresses.

I'm curious whether our addition of a bunch of new PO Box data in v0.5.8 using "departments" as SubAddressTypes is confusing the parser here. But then again, when I complete the address with PlaceName, StateName, and ZipCode, the performance improves:

In [7]: usaddress.parse("'GALLY' International Biomedical Research Consulting LLC, 7733 Louis Pasteur Drive, #330. San Antonio TX 78229")
Out[7]:
[("GALLY'", 'Recipient'),
 ('International', 'Recipient'),
 ('Biomedical', 'Recipient'),
 ('Research', 'Recipient'),
 ('Consulting', 'Recipient'),
 ('LLC,', 'Recipient'),
 ('7733', 'AddressNumber'),
 ('Louis', 'StreetName'),
 ('Pasteur', 'StreetName'),
 ('Drive,', 'StreetNamePostType'),
 ('#', 'OccupancyIdentifier'),
 ('330.', 'OccupancyIdentifier'),
 ('San', 'PlaceName'),
 ('Antonio', 'PlaceName'),
 ('TX', 'StateName'),
 ('78229', 'ZipCode')]
In [6]: usaddress.parse('New Mexico Department of Game and Fish, One Wildlife Way, Santa Fe NM 87507')
Out[6]:
[('New', 'Recipient'),
 ('Mexico', 'Recipient'),
 ('Department', 'Recipient'),
 ('of', 'Recipient'),
 ('Game', 'Recipient'),
 ('and', 'Recipient'),
 ('Fish,', 'Recipient'),
 ('One', 'AddressNumber'),
 ('Wildlife', 'StreetName'),
 ('Way,', 'StreetNamePostType'),
 ('Santa', 'PlaceName'),
 ('Fe', 'PlaceName'),
 ('NM', 'StateName'),
 ('87507', 'ZipCode')] 

Something to look into!

firefly454 commented 7 years ago

I found another example:

"Department of Physics and Astronomy, West Virginia University, White Hall, Box 6315"

in v0.5.8: [(u'Department', 'Recipient'), (u'of', 'Recipient'), (u'Physics', 'Recipient'), (u'and', 'Recipient'), (u'Astronomy,', 'Recipient'), (u'West', 'Recipient'), (u'Virginia', 'Recipient'), (u'University,', 'Recipient'), (u'White', 'Recipient'), (u'Hall,', 'USPSBoxType'), (u'Box', 'USPSBoxType'), (u'6315', 'USPSBoxID')]

in v.0.5.5: 'White Hall' both get tagged correctly as receipient (or maybe this is BuildingName... but it's interesting that "White Hall" gets split apart

jeancochrane commented 7 years ago

Thanks @firefly454! Out of curiosity, is this the full address as it shows up in your data? I'm wondering how usaddress would perform if it had placename/statename tokens in the address.

firefly454 commented 7 years ago

Hi @jeancochrane ,

Yes, that's the full string in my data. I'm actually using usaddress to clean up academic paper author affiliations that previously got improperly parsed. Usually they are something like "Department of Chemistry, Northwestern University", but other times, I'll get "Department of Civil and Environmental Engineering, Northwestern University, 2145 Sheridan Road", and I use usaddress to tag the substrings, and remove the ones that are not "recipients", "placename", or "buildingname".

jeancochrane commented 7 years ago

How have things been performing recently @firefly454?