datamade / usaddress

:us: a python library for parsing unstructured United States address strings into address components
https://parserator.datamade.us/usaddress
MIT License
1.52k stars 304 forks source link

Issues with some rural Ohio Addresses #132

Open BenGalewsky opened 8 years ago

BenGalewsky commented 8 years ago

I'm trying to process a large dataset of Ohio addresses. 99% of them are processed perfectly. Thank you for this great resource.

There are still a substantial number of addresses that throw a RepeatedLabelError exception. They fall into a few categories

  1. Directional words inside the street name: "Big Run South Rd"
  2. The use of "FL" as Floor for some occupancy type
  3. Lot as an occupancy type
  4. Trailer parks
  5. Some interesting inter-sectional addresses: "100 DELAWARE XING W 140-160 APT 144"

Some are just messes and I doubt we will ever be able to parse them.

Can you look over these examples and see if we can tweek the parser to accept some of them? I plan on adding rules to my python script to do some translation or cleanup to try to address some of the common data quality issues.

28 TOWNSHIP ROAD 281 LOT 23
1111 STATE ROUTE 133 LOT 44
3925 N RIDGE RD E LOT 99
225 BOSLEY ST APT 1ST FL

206 P D 2555 CO RD 70 
20 P D 9136 CO RD 1 APT 3
236 TWP RD 279E LOT 3
27 P D 72 TWP RD 510N APT 4
151 P R 80 TWP RD 1076 UNIT C
241-C TWP RD 1430 APT 12
4201 COUNTY ROAD 220 APT 9

8030 DEEPWOOD BV BLDG H-14
5782 ANDREWS RD (BLDG A B G H I) BLDG G-103

8911 LESOURDSVILLE W CHESTER RD 
14723 MOULTON FT AMANDA RD 
3490 PERU W SECTION LINE RD 
3550 BIG RUN SOUTH RD 
5210 MEADOW RD NE HALF
556 EATON FT NESBIT RD 
5292 BIG RUN SOUTH RD 
93 JOHN W BARBEE RD 
4011/2 BEECH ST APT 1ST FL

100 DELAWARE XING W 140-160 APT 144
360 TWELFTH ST NW LOT S
1245 U S 52 APT A
13925 U S 22 & S R 3 E 

28 E UNIVERSITY AVE APT 2 FL 2
4805 TOWNSHIP ROAD 366 UNIT 131
844 N CLINTON ST LOT B LOT C26
1526 S GREENWICH E TWNLI 79 RD 
1068 1/2 MT PLEASANT AVE 

1701 W ROBB AVE L1 TRLR 1
26720 WHITEWAY DR F APT 104
921 ODNR MOHICAN #51 APT 51
11323 LEBANON RD APT TRL 13
844 N CLINTON ST LOT B LOT B 26
jeancochrane commented 7 years ago

Hey @BenGalewsky,

Thanks for filing this! Glad to hear usaddress is generally serving you well.

Some of those errors are better handled on the data cleaning side, I think. In particular, tagging "FL" as "floor" could be dangerous (certain southeastern states may not be pleased with the result) unless we had a whole bunch of training data to help the model figure out context. Did you have success with Python?

I'd be glad to address the other errors by bringing in new training data. I'm particularly interested in handling trailers, lots, and unconventional intersection types. Did you wind up building your own training data for these errors? Go ahead and give it a shot following our build instructions and submit a PR when you're ready. If you're unsure how to tag something, follow the guidelines in our docs (or consult the official data standard), or just drop me a line. Don't hesitate to @ me!

BenGalewsky commented 7 years ago

Some people on our team have expressed an interest in re-training the model - I suspect that is what will happen.

I've mostly been giving USAddress just the street portion (and keeping city and state on the side so as not to confuse its little brain) - so FL will never mean anything other than Floor for us, but I appreciate that we want USAddress to be useful for everyone.

jeancochrane commented 7 years ago

Great! Keep us posted on how your training goes.

tanyaschlusser commented 7 years ago

Hi I'm looking at this but I can't tell the difference between 'occupancy type' and 'subaddress type' ... I looked at the recommended definition but can't find an OccupancyType element defined anywhere except mentioned except in section 2.2.4.1 as a synonym for 'Subaddress Type'. Do you all have some rules of thumb? (thanks!)

BenGalewsky commented 7 years ago

I believe that occupancy type is for things like "Apartment" or "Suite", "Unit" or whatever. It goes with Occupancy Identifier. There seems to be an affinity in some voter files to identify people by their "Floor", or "FL" those seem like they would be occupancy identifiers

tanyaschlusser commented 7 years ago

Ah! OK, 'Building', 'Trailer' ... and other identifiers of single physical objects = 'subaddress type' and 'Apartment', 'Suite', 'Unit' that subdivide an object = 'occupancy type' I'll go with that

jeancochrane commented 7 years ago

This is actually a really interesting question. It's making me think we should probably have a specific policy to describe the distinction between occupancies and subaddresses. (There may be one that I'm not currently aware of.)

@tanyaschlusser, I also couldn't find an explicit explanation of the difference between SubaddressType and OccupancyType in the docs you linked – according to the Standard, occupancy would appear to be a subcategory of subaddress. This is certainly confusing, especially because usaddress currently tends to mark the types of tokens that @BenGalewsky mentioned as OccupancyTypes. Occupancy is clearly our preferred category, at least at this point in usaddress' life. This may not follow the letter of the Standard (although again, I'm not an expert) but it seems to work fine in most cases.

There are two problems that arise from this setup, however. For one thing, we may not be in keeping with the Address Standard: if page 74 of the URISA docs is right, then SubaddressType may be the preferred designation for "Apartment, Suite, Room, Unit, Office" – all of which we now classify as OccupancyType. But the second (and more pressing, to me) problem is that addresses frequently contain two pairs of tokens that match our OccupancyType pattern (e.g. "Apartment A Floor 2", "Trailer 5 Lot 4") which raises an error with the tag method. To fix this problem, I've been training data to maintain the preference for OccupancyType, but to label additional physical specifications as SubaddressTypes; but depending on our occupancy/subaddress policy, it may make more sense to simply tweak tag to allow multiple OccupancyTypes.

Curious to hear @fgregg weigh in on this.

Surendra414 commented 7 years ago

Hi jeancochrane, This is too much interested to parse the address.But i am getting an issue, if i am putting this address "05 St. George Street, Toronto, ON, M5S 3E6, Canada" is gives me perfect result but Country is showing in Placename. So my question is that i required country name also in the address list so how i can do this. Kindly suggest me that.

jeancochrane commented 7 years ago

Hey @Surendra414,

Unfortunately, usaddress currently doesn't support addresses outside of the US. That's probably leading to the bad parse you see here. There are some ideas for major changes that might expand our scope, but it's not currently in the works. For international parsing you could try pypostal, which has much heavier dependencies but might provide better performance for Canadian addresses.