Open mraross opened 6 years ago
We will not be able to identify street names or localities that are not in the province, so we won't be able to identify that the province name is the province name. "blah blah Ontario" is just as likely an occupant or site name on ontario street as it a street address in ontario.
As a first step, we should focus on reliably identifying alien addresses and assigning an address.isAlien fault to them instead of matching to a false positive address somewhere in BC. Here's a current example of such a false positive using Geocoder 4.1:
122 Albert St, Port Melbourne, Victoria australia
matches to this:
122 Lambert St, Quesnel, BC.
At least with accurate alien detection, a script can filter aliens out of the batch geocoder results file and apply a global geocoder to them.
@bstratto Feel free to add a comment describing potential alien detectors you've discovered in your rejected address analysis.
Below are a few patterns for identifying addresses in other countries. This is based on analysis of the 13 million HealthIdeas addresses and reflects the examples available in that dataset.
These patterns provide only a subset of the HealthIdeas addresses in these countries. There are many more addresses for which there is no “safe” pattern (i.e. a pattern would have the potential of also eliminating addresses where a BC location is included).
Pattern for addresses in Germany: The HealthIdeas addresses show that people use the general formats: • German zip code + “, GERMANY, BC” • German zip code + German locality + “, GERMANY, BC”
To make this pattern safe, we would have to check that the text in German locality is not in fact the name of a BC locality. For example, there could be an address “10319 HOPE, GERMANY, BC”
The below pattern was tested with HealthIdeas and returns only German alien addresses: • The first 5 characters are numbers • The length of the address is <= 30 (longer addressStrings tend to include BC address text) • The address ends with “, GERMANY, BC” • The text preceding “, GERMANY, BC” is not the abbreviation for a street type • The text preceding “, GERMANY, BC” is not a known BC locality
Below are some examples. This pattern identifies 65 addresses in the HealthIdeas dataset:
addressString | Standardized address | Score |
---|---|---|
55131 MAINZ, GERMANY, BC | German Rd, Flatrock, BC | 55 |
60486 FRANKFURT, GERMANY, BC | BC | 1 |
61350 BAD HOMBERG, GERMANY, BC | BC | 1 |
27612 LOXO ZECHT, GERMANY, BC | Zacht 5 near Kanaka Bar, BC | 52 |
28211 BREMEN, GERMANY, BC | 28211 Herman S. Braich Blvd, Mission, BC | 76 |
Pattern for addresses in England: The HealthIdeas addresses show that people use the general formats: • “, ENGLAND, BC” • England locality + “ ENGLAND, BC” or England locality + “, ENGLAND, BC”
To make this pattern safe, we would have to check that the text in
The below pattern was tested with HealthIdeas and returns only England alien addresses: • The first character is not a number • The length of the address is <= 25 (longer addressStrings tend to include BC address text) • The address ends with “ENGLAND, BC” • The text preceding “ENGLAND, BC” does not include a known BC locality
Below are some examples. This pattern identifies 173 addresses in the HealthIdeas dataset:
addressString | Standardized address | Score |
---|---|---|
VISITOR FROM, ENGLAND, BC | Vision Way, Langford, BC | 24 |
WELSHPOOL, ENGLAND, BC | BC | 1 |
VISITING, ENGLAND, BC | BC | 1 |
WEST SUSSEX, ENGLAND, BC | West Boulevard, Vancouver, BC | 69 |
, KENT ENGLAND, BC | England Rd, Courtenay, BC | 64 |
, ENGLAND, BC | England Ave, Courtenay, BC | 62 |
Pattern for addresses in the United States: The HealthIdeas addresses show that people use the general formats: • text + “, USA, BC” • text + “, US, BC” • 6 to 10 numeric digits + text + “ , USA, BC”
To make these patterns safe, we would have to check that the text is not in fact the name of a BC locality. For example, there could be an address “HOPE, USA, BC”. This text, however, may also include localities that have similar names or exist in both US and BC, such as “MT VERNON, USA, BC”. Geocoder would have to “make a call” regarding these.
The below patterns were tested with HealthIdeas and return only United States alien addresses:
• Pattern 1: Non-numeric (106 addresses found in HealthIdeas) o The first 2 characters are not a number o The length of the address is <= 19 (longer addressStrings tend to include BC address text) o The address ends with “ US, BC” or “ USA, BC” o The text preceding “ US, BC” (or “ USA, BC”) is not a known BC locality or “UVIC” or “UBC”
• Pattern 2: Numeric (52 addresses found in HealthIdeas) o The first 6 characters are numbers o The length of the address is <= 19 (longer addressStrings tend to include BC address text) o The address ends with “, USA, BC” o The characters in position 7-10 are one of these: space, comma, “U”, “S”
Below are some examples. Numeric addresses were redacted.
addressString | Standardized address | Score |
---|---|---|
MT VERNON, USA, BC | Mt Atkinson Pl, Vernon, BC | 69 |
ALTONA PA, USA, BC | Pa-aat 6 near Pitt Island, BC | 54 |
, US, BC | BC | 1 |
, USA, BC | BC | 1 |
ARIZONA, USA, BC | BC | 1 |
BBBBBB, USA, BC | BC | 1 |
9999999, USA, BC | BC | 1 |
9999999, USA, BC | BC | 1 |
9999999 01, USA, BC | BC | 1 |
Min of Advanced Education has out of province addresses that are being mangled by the current geocoder. For example, when the geocoder is faced with an address that is outside BC, it tries to interpret it as an address somewhere within BC to sometimes hilarious effect. For example:
13 oakwood ave toronto on
becomes:
Premier, BC
Ministry of Advanced Education would be happy if an out of province address returned a fullAddress of ON, CA with a matchPrecision of Province. If an input address is outside of Canada, fullAddress should return the ISO Alpha-2 country code and a matchPrecision of Country.
Here is one approach to the problem:
Add a global populated places table to the current geocoder, load it from geonames.org, and add support for a parameter that indicates address may be located outside of BC. Here are some details:
scopeGlobal is a new parameter that if true, indicates addressString might be located outside the jurisdiction of the geocoder (e.g., another province, another country).
if scopeGlobal=false (the default), the geocoder assumes address is located within the geocoder's jurisdiction (e.g., BC). This is the current geocoder's behavior.
if scopeGlobal=true, the geocoder will check if the input address is located within a province other than BC or a country other than Canada. If the input address is found outside of BC but within Canada, return a fullAddress of the ISO subCountry code (e.g., ON) plus the ISO Alpha2 Country code (e.g., CA) and a matchPrecision of Province. If input address is found outside of Canada, return a fullAddress of the ISO country code and a matchPrecision of Country. Also raise an address.outsideJurisdiction fault with a penalty of 1 and a match precision of Province or Country.
We could also set lat/lon to province or country point. This will require province and country location tables.
We could only recognize ISO Country and sub-country codes and rely on abbreviation mappings to handle common country names such as Canada, Japan, South Korea, China, United States of America