datamade / usaddress

:us: a python library for parsing unstructured United States address strings into address components
https://parserator.datamade.us/usaddress
MIT License
1.52k stars 304 forks source link

Any tips for training the model?? #214

Closed sandyahluwalia closed 6 years ago

sandyahluwalia commented 6 years ago

I can't seem to get the model to train properly. Short of adding the exact match, is there any suggested methods for adding an address to the labeled.xml so that it learns from that example and can use it in order similar cases?

The example I have is, I'm trying to train the model to understand that: 206 4430 CHATTERTON WAY should be OccupancyIdentifier, AddressNumber, StreetName, StreetNamePostType. I've added dozens of variations and the model doesn't seem to learn. In fact, sometimes it gets worse :-0

Thanks guys

fgregg commented 6 years ago

That's a very unusual pattern that really differs from every other example in the training set. You have to provide a lot of example to overcome that.

sandyahluwalia commented 6 years ago

Thanks for the quick response!

Is it really? I thought standard format was:

suite#-street# streetname streettype ie. 905-1 main st

or street# streetname streettype Suite# ie. 1 main st. Suite 905 / 1 main st. # 905

I tried adding about 40 examples with slight variations to the data and couldn't get it to work. I'll can certainly add a ton more. Should there be tons of variation between each example or only slight variations between fields? Or for the field that it's getting wrong, should that be the only thing that changes and everything else is the same?

Big thanks btw for this awesome project.

fgregg commented 6 years ago

no, this is a very unusual pattern for the united states.

The typical pattern is streetnumber streetname street type occupancy identifer ...

sandyahluwalia commented 6 years ago

fair enough. I've added over 3500 pieces of training data to teach the model this pattern and doesn't seem to take. Anything you can suggest?

fgregg commented 6 years ago

remove the existing training data.

On Fri, Feb 9, 2018 at 5:30 PM, bobberino1 notifications@github.com wrote:

fair enough. I've added over 3500 pieces of training data to teach the model this pattern and doesn't seem to take. Anything you can suggest?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/datamade/usaddress/issues/214#issuecomment-364599136, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgxbYYhtdCPL6KUqydySKMY1yxfXT67ks5tTNT-gaJpZM4SAfY_ .

-- 773.888.2718