agrc / sweeper

🧹A cli tool for making data good 🧹
MIT License
4 stars 3 forks source link

Street Name Misspellings #26

Open stdavis opened 5 years ago

stdavis commented 5 years ago

From @steveoh

get the unique street names from our roads data and address points. then parse their addresses to the parts and see if the road exists in our data or something similar with levenshtein to catch misspellings

From @ZachBeck

[Look] for compound word misspellings like Switchback Way vs Switch Back Way

Not sure on the best way to do this. Perhaps trying to compare concatenated multiple word street names to the known list of street names? Or maybe something like levenshtein could handle this.

ZachBeck commented 5 years ago

This one is tough because it's not always clear which version it should be. All that you know is that the street name in address points is different from the roads.

stdavis commented 5 years ago

Is the roads feature class the accepted source of truth for road names over the address point data? Or are you saying that it's not clear which is correct? In my mind, we need to just pick one (from the parser perspective at least) as the single source of truth.

ZachBeck commented 5 years ago

That's a tough one... I'd say it depends on the county which is better (roads vs address pts). It's not always clear which one is correct

stdavis commented 5 years ago

For this project, we may just need to pick the one that we hope it ends up as and go with it. I'm assuming that would be roads, but I'm not the one to make the call. We can talk more as this project moves forward.

steveoh commented 5 years ago

It would help us choose if we have a little more information. @ZachBeck can you list which counties have better data in the address points vs the roads. We could then pick based on the amount of data etc. I know that greg erik and zach are trying to resolve the discrepancies so hopefully they will match up better in the future.

stdavis commented 6 months ago

@gregbunce or @ZachBeck Do the street names align better between the roads and address point data these days?

gregbunce commented 6 months ago

for the most part, it's pretty good, but they don't align perfectly. we (zach) have some code to check for alignment, but we don't run it very often.

steveoh commented 6 months ago

Is there a reason not to run it after every update?

ZachBeck commented 6 months ago

For the most part I run it every time I update a county's address points. To reconcile the differences it would be a matter of looking at each individual road in google street view to figure out where the problem is.