datamade / usaddress

:us: a python library for parsing unstructured United States address strings into address components
https://parserator.datamade.us/usaddress
MIT License
1.52k stars 304 forks source link

Retraining the model: Further Breaking a Parse? #106

Closed BlvdJoe closed 8 years ago

BlvdJoe commented 9 years ago

First off, PLEASE let me know if I'm submitting these questions in the wrong place. Very new to coding and GitHub.

I've read over the training documentation, but is there a way to add an option to further "break" a parse in the training module? For example, I'd like to be able to break "100-18C" into AddressNumber: 100, OccupancyIdentifier: 18C as it currently parses AddressNumber: 100-18C.

Oh right! 100% do not hesitate to kindly tell me that this is a "Get better at python, buckaroo."-question, not an Issue. Thanks.

cathydeng commented 8 years ago

hey @BlvdJoe - this sounds like it should either be (1) a custom post-processing step, after an address is already parsed with usaddress or (2) a separate address parser with a different tokenizer.

the tokenize method is what splits an input string into tokens, and usaddress splits on spaces & will also break up chunks of text on certain types of punctuation (,;#&()) https://github.com/datamade/usaddress/blob/master/usaddress/__init__.py#L119-L126 - this is something I'd rather not change in usaddress. you could make a new parser by forking usaddress & tweaking the tokenizer, the labels, & the training data to your needs.