Closed anastasiaclark closed 7 years ago
Hey @GeoAC1984,
The parse
method doesn't do any post-processing to get rid of punctuation like this – you may get better mileage with the tag
method, which strips commas and semicolons after tagging. I'm not totally sure why this isn't a feature of parse
, but @fgregg may have more insight. If there's no good reason for it we can go ahead and introduce a stripping method to parse
, too.
I'm very interested in the StateName
failure with the LH
token in the string 4315 WEBSTER AVENUE, LH
. I'm guessing "LH" stands for "Left-Hand side"? Got any more addresses that fit this pattern that we can add in for training?
Hi @jeancochrane,
Thanks, tag method indeed is better suitable for my purpose. Here is a bigger address sample from the dataset. Addresses on the lines 1447, 1708, 2250, 4603, 4756, 5872 and 6731 were also assigned state token in error. I believe LH is just an apartment number; this data is real property sales from the NYC Finance Department.
Awesome, thanks @GeoAC1984! I'll add those addresses to the next round of training data and close this when it gets incorporated.
After parsing with usaddress.parse, the name of the StreetNamePostType contains comma in it, like in 'AVENUE,'.
Also state label is applied incorrectly. See example dataset in the attachment. address_sample.zip