A current problem in finding duplicates is that many addresses are unstandardized. There are many flavors of this:
Errors in text substitutions: At some point in compilation (perhaps going back to the source data) text substitutions done that went too far. Examples include:
DuBois BP -> DuBlvdis BP
1001 Washington Pike -> 1001 WasHighwayngton Pikeke
Some of the addresses have been truncated (1 Urbano Way -> 1 Urbano Wa)
Many of the addresses have extra information
Many have parentheses, such as 2201 Salisbury Street (At Spray Park location (next to baseball field))
Encoding errors: Jodinko’s Farm Market
in addition to expected errors, such as differences in abbreviation (St. vs Street)
Missing street suffixes
Missing street numbers (Perrysville Avenue (Observatory Entrance))
We've decided to ignore the address issues and just lat/long for deducing (not exact lat/longs but use general area to flag potential location duplicates).
A current problem in finding duplicates is that many addresses are unstandardized. There are many flavors of this:
DuBois BP
->DuBlvdis BP
1001 Washington Pike
->1001 WasHighwayngton Pikeke
1 Urbano Way
->1 Urbano Wa
)2201 Salisbury Street (At Spray Park location (next to baseball field))
Jodinko’s Farm Market
Perrysville Avenue (Observatory Entrance)
)among others