Retraining the model: Further Breaking a Parse?

datamade / usaddress

:us: a python library for parsing unstructured United States address strings into address components

MIT License

1.52k stars 304 forks source link

hey @BlvdJoe - this sounds like it should either be (1) a custom post-processing step, after an address is already parsed with usaddress or (2) a separate address parser with a different tokenizer.

the tokenize method is what splits an input string into tokens, and usaddress splits on spaces & will also break up chunks of text on certain types of punctuation (,;#&()) https://github.com/datamade/usaddress/blob/master/usaddress/__init__.py#L119-L126 - this is something I'd rather not change in usaddress. you could make a new parser by forking usaddress & tweaking the tokenizer, the labels, & the training data to your needs.

datamade / usaddress

Retraining the model: Further Breaking a Parse? #106