Open BenGalewsky opened 8 years ago
Hey @BenGalewsky,
Thanks for filing this! Glad to hear usaddress is generally serving you well.
Some of those errors are better handled on the data cleaning side, I think. In particular, tagging "FL" as "floor" could be dangerous (certain southeastern states may not be pleased with the result) unless we had a whole bunch of training data to help the model figure out context. Did you have success with Python?
I'd be glad to address the other errors by bringing in new training data. I'm particularly interested in handling trailers, lots, and unconventional intersection types. Did you wind up building your own training data for these errors? Go ahead and give it a shot following our build instructions and submit a PR when you're ready. If you're unsure how to tag something, follow the guidelines in our docs (or consult the official data standard), or just drop me a line. Don't hesitate to @ me!
Some people on our team have expressed an interest in re-training the model - I suspect that is what will happen.
I've mostly been giving USAddress just the street portion (and keeping city and state on the side so as not to confuse its little brain) - so FL will never mean anything other than Floor for us, but I appreciate that we want USAddress to be useful for everyone.
Great! Keep us posted on how your training goes.
Hi I'm looking at this but I can't tell the difference between 'occupancy type' and 'subaddress type' ... I looked at the recommended definition but can't find an OccupancyType element defined anywhere except mentioned except in section 2.2.4.1 as a synonym for 'Subaddress Type'. Do you all have some rules of thumb? (thanks!)
I believe that occupancy type is for things like "Apartment" or "Suite", "Unit" or whatever. It goes with Occupancy Identifier. There seems to be an affinity in some voter files to identify people by their "Floor", or "FL" those seem like they would be occupancy identifiers
Ah! OK, 'Building', 'Trailer' ... and other identifiers of single physical objects = 'subaddress type' and 'Apartment', 'Suite', 'Unit' that subdivide an object = 'occupancy type' I'll go with that
This is actually a really interesting question. It's making me think we should probably have a specific policy to describe the distinction between occupancies and subaddresses. (There may be one that I'm not currently aware of.)
@tanyaschlusser, I also couldn't find an explicit explanation of the difference between SubaddressType
and OccupancyType
in the docs you linked – according to the Standard, occupancy would appear to be a subcategory of subaddress. This is certainly confusing, especially because usaddress currently tends to mark the types of tokens that @BenGalewsky mentioned as OccupancyTypes
. Occupancy is clearly our preferred category, at least at this point in usaddress' life. This may not follow the letter of the Standard (although again, I'm not an expert) but it seems to work fine in most cases.
There are two problems that arise from this setup, however. For one thing, we may not be in keeping with the Address Standard: if page 74 of the URISA docs is right, then SubaddressType
may be the preferred designation for "Apartment, Suite, Room, Unit, Office" – all of which we now classify as OccupancyType
. But the second (and more pressing, to me) problem is that addresses frequently contain two pairs of tokens that match our OccupancyType
pattern (e.g. "Apartment A Floor 2", "Trailer 5 Lot 4") which raises an error with the tag
method. To fix this problem, I've been training data to maintain the preference for OccupancyType
, but to label additional physical specifications as SubaddressTypes
; but depending on our occupancy/subaddress policy, it may make more sense to simply tweak tag
to allow multiple OccupancyTypes.
Curious to hear @fgregg weigh in on this.
Hi jeancochrane, This is too much interested to parse the address.But i am getting an issue, if i am putting this address "05 St. George Street, Toronto, ON, M5S 3E6, Canada" is gives me perfect result but Country is showing in Placename. So my question is that i required country name also in the address list so how i can do this. Kindly suggest me that.
Hey @Surendra414,
Unfortunately, usaddress currently doesn't support addresses outside of the US. That's probably leading to the bad parse you see here. There are some ideas for major changes that might expand our scope, but it's not currently in the works. For international parsing you could try pypostal, which has much heavier dependencies but might provide better performance for Canadian addresses.
I'm trying to process a large dataset of Ohio addresses. 99% of them are processed perfectly. Thank you for this great resource.
There are still a substantial number of addresses that throw a RepeatedLabelError exception. They fall into a few categories
Some are just messes and I doubt we will ever be able to parse them.
Can you look over these examples and see if we can tweek the parser to accept some of them? I plan on adding rules to my python script to do some translation or cleanup to try to address some of the common data quality issues.