datamade / usaddress

:us: a python library for parsing unstructured United States address strings into address components
https://parserator.datamade.us/usaddress
MIT License
1.53k stars 303 forks source link

comma included in the name of the parsed StreetNamePostType #171

Closed anastasiaclark closed 7 years ago

anastasiaclark commented 7 years ago

After parsing with usaddress.parse, the name of the StreetNamePostType contains comma in it, like in 'AVENUE,'.

Also state label is applied incorrectly. See example dataset in the attachment. address_sample.zip

jeancochrane commented 7 years ago

Hey @GeoAC1984,

The parse method doesn't do any post-processing to get rid of punctuation like this – you may get better mileage with the tag method, which strips commas and semicolons after tagging. I'm not totally sure why this isn't a feature of parse, but @fgregg may have more insight. If there's no good reason for it we can go ahead and introduce a stripping method to parse, too.

I'm very interested in the StateName failure with the LH token in the string 4315 WEBSTER AVENUE, LH. I'm guessing "LH" stands for "Left-Hand side"? Got any more addresses that fit this pattern that we can add in for training?

anastasiaclark commented 7 years ago

Hi @jeancochrane,

Thanks, tag method indeed is better suitable for my purpose. Here is a bigger address sample from the dataset. Addresses on the lines 1447, 1708, 2250, 4603, 4756, 5872 and 6731 were also assigned state token in error. I believe LH is just an apartment number; this data is real property sales from the NYC Finance Department.

address_sample.zip

jeancochrane commented 7 years ago

Awesome, thanks @GeoAC1984! I'll add those addresses to the next round of training data and close this when it gets incorporated.