Open jeancochrane opened 7 years ago
I found another example:
"Department of Physics and Astronomy, West Virginia University, White Hall, Box 6315"
in v0.5.8: [(u'Department', 'Recipient'), (u'of', 'Recipient'), (u'Physics', 'Recipient'), (u'and', 'Recipient'), (u'Astronomy,', 'Recipient'), (u'West', 'Recipient'), (u'Virginia', 'Recipient'), (u'University,', 'Recipient'), (u'White', 'Recipient'), (u'Hall,', 'USPSBoxType'), (u'Box', 'USPSBoxType'), (u'6315', 'USPSBoxID')]
in v.0.5.5: 'White Hall' both get tagged correctly as receipient (or maybe this is BuildingName... but it's interesting that "White Hall" gets split apart
Thanks @firefly454! Out of curiosity, is this the full address as it shows up in your data? I'm wondering how usaddress would perform if it had placename
/statename
tokens in the address.
Hi @jeancochrane ,
Yes, that's the full string in my data. I'm actually using usaddress to clean up academic paper author affiliations that previously got improperly parsed. Usually they are something like "Department of Chemistry, Northwestern University", but other times, I'll get "Department of Civil and Environmental Engineering, Northwestern University, 2145 Sheridan Road", and I use usaddress to tag the substrings, and remove the ones that are not "recipients", "placename", or "buildingname".
How have things been performing recently @firefly454?
Erica Kim notified us over email that the web app (v0.5.5) was performing better for parsing certain company names than the current version (v0.5.8). She's going to keep an eye out for more failures, but so far here's what has failed:
Clearly, it's getting confused about who exactly is the
Recipient
in these addresses.I'm curious whether our addition of a bunch of new PO Box data in v0.5.8 using "departments" as
SubAddressTypes
is confusing the parser here. But then again, when I complete the address withPlaceName
,StateName
, andZipCode
, the performance improves:Something to look into!