Add support for glued and unglued words in site, street, and locality name matching

gleeming commented 6 years ago

A compound word is a word composed of sub-words where a sub-word is also a word. An example of a compound word is Mapleridge which consists of sub-words "maple" and "ridge". A streetName based on Mapleridge may appear in ITN in two forms:

 1. Mapleridge

 2. Maple Ridge

The geocoder should be able to match an input address to either form.

Currently user input like Birds Eye Dr Duncan BC will not cleanly match to the concatenated name known within ITN: Birdseye Dr. It is worth considering if it is possible to change the geocoder to accommodate this type of match.

Perhaps there is a way to change the parser to accommodate many-to-one word mappings in the street name or to introduce logic that ignores spaces when looking up a street candidate.

This is needed by all clients to reduce the number of poor matches they have to deal with.

mraross commented 6 years ago

Would making use of an embedded space metacharacter help? For example, Birds^eye or Trans^Canada.

mraross commented 5 years ago

Chris Hodgson wrote:

GSR includes the address "#207-5462 transcanada hwy duncan bc" which will not match to the official name of "Trans-Canada Hwy" because internally the "-" is treated as a space. "Trans Canada Hwy" would work. A similar problem exists with compound words, eg. "Hillcrest" vs. "Hill Crest" or "Searidge" vs. "Sea Ridge". Previously these cases were handled by aliases generated in the processing, which has no been removed in favor of run-time handling of multi-word matching, but the current implementation doesn't handle this case. It may be possible to add the compounded words to the dictionary, effectively spell-correcting from "transcanada" to "trans canada" however some work will need to be done to handle the fact that the number of words has changed as this is not currently allowed for.

The reverse case, where the official street name is a compound word eg. "hillcrest" and the input is given as multiple words eg. "hill crest", is even harder to handle. Users entering the name into an auto-complete field will potentially get some help with these cases.

Needed by all clients to reduce the number of missed matches by the geocoder.

cmhodgson commented 5 years ago

An approach where all the spaces are removed from fields for matching purposes might work, then we compare the versions with spaces to see if there is a best match, and/or return appropriate match faults for missing/added spaces.

cmhodgson commented 5 years ago

This problem breaks down into 2 parts: 1) identify compound words in input and base data 2) normalize compound words in both input and base data (by either inserting spaces between the compound words, or removing all spaces between words) for matching/lookup

In order to identify compound words, we have to compare them against a dictionary; If we identify all compound words in base data using a complete dictionary, then store them as space-separated words for lookup and matching, then we identify compound words in the input we only need to use the dictionary of words in our base data. I think this is the best approach.

mraross commented 5 years ago

Just to confirm my understanding, are you proposing a new compound word dictionary configuration parameter that contains all compound words and their split form as in:

Mapleridge => Maple Ridge Cherryridge => Cherry Ridge

cmhodgson commented 5 years ago

No. The code would identify compound words based on them being two dictionary words stuck together, where the dictionary is some standard "complete" English dictionary. Because we would need the code to do that identification somewhere, anyways, so why not make it part of the app. I guess for expedience, the compound identification and manipulation could happen in a prep phase and be passed through in the data (although a more aggressive data prep and storage format is already a ToDo for reduced startup time).

cmhodgson commented 4 years ago

I just realized that this approach is getting awfully close to using ngrams, we're just using variable-length n-grams where the length is based on the word-breakdown of the compound words. If we took a step further and based it on phonemes we'd be closer to soundex, and as step further would be tri-grams or bi-grams. ngrams are the solution to the general case of this problem where you have no dictionary to base the break-down on.

mraross commented 4 years ago

I think an N-gram based algorithm would handle the word split problem well and handle simple spelling mistakes but I wouldn't try phonemes. From our experience with the geonames service, phonetic approaches like soundex return too many false positives.

mraross commented 4 years ago

Ngram similarity and distance http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

Implementation in lucene ftp://ftp.netapp.com/frm-ntap/opensource/Active_IQ_Performance_Analytics_Services/2.0/opensource_packages/jar_sources/lucene-suggest-5.5.4.jar/org/apache/lucene/search/spell/NGramDistance.java

https://lucene.apache.org/core/2_9_4/api/contrib-spellchecker/org/apache/lucene/search/spell/NGramDistance.html

gleeming commented 3 years ago

I added a python module called cwsplit into the FME prep scripts. This will need to be installed on the FME government prep environment. It isn't perfect as it returns only one response -- a set of split words that has the largest subword it could find starting at the first character. So we get Sear Id Ge back instead of Sea Ridge for Searidge. However it appears to work very well in most cases and was successful on thousands of entries as per the following logic

find name body candidates that do not have a space or leading digit
apply cwsplit with the English dictionary option and a minimum 3-character first word
reject results that were unsplittable or had any returned word that was too short (1 or 2 characters)
generate a street name alias / name record / locality centroid entry for each successful candidate

Examples: Alex And Ria Hub Bard

mraross commented 3 years ago

@gleeming

input	output	expected
3821 Cedarhill Rd, Saanich, BC	3821 Edgehill Ave, Abbotsford, BC	3821 Cedar Hill Rd, Saanich, BC
8514 Horse Shoe Bay Rd, Anglemont, BC	Anglemont, BC	8514 Horseshoe Bay Rd, Anglemont, BC

cmhodgson commented 3 years ago

We are not currently separating glued compound words in query input, so we do not separate "cedarhill" into "cedar hill", so it can't make this match. This is also not simple to do, as it would change the number of words (tokens) in the input, and that is not something that we can do without significant changes to the lexing and parsing process.

mraross commented 3 years ago

As discussed by phone with Chris this morning, Graeme had a script that combined street name words into word pairs and added the resulting street names to road segments as street Name aliases (e.g., Cedar Hill Rd => Cedarhill Rd). This seems to have been dropped from the 4.1 Silver data prep scripts. Please estimate on restoring this script.

Also discussed was the need to support a configurable list of compound words to augment the automated splitting script so that when we find an example that the splitter doesn't handle (e.g., Horseshoe), we can add:

 Horseshoe => Horse Shoe

to the list. Please estimate the cost of maintaining (in the admin app?) and using this list.

bstratto commented 3 years ago

Here are additional examples of glued words found in the HLTH file. The numbers of addresses provided below are based on a sample of 695,343 HLTH addresses. A number of these score above 90, Where they don't, the addition/removal of the space would be enough to make it go above 90 in many cases.

Address with glued/unglued street name => Number of addresses in sample 3559 BRENDALEE RD, WESTBANK, BC => 56 1502 EAGLERUN DR, BRACKENDALE, BC => 14 5702 HIGHWAY3A, WYNNDEL, BC => 8 598 SANDYHOOK RD, SECHELT, BC => 23 B 7093 THUNDERBAY, POWELL RIVER, BC => 5 4012 JINGLEPOT R, NANAIMO, BC => 23 412 2560 DEPARTUREBAY, NANAIMO, BC => 3 980 CHERYLANN PARK RD, GIBSONS, BC => 11 2593 STONESBAY, FORT ST JAMES, BC => 28 3861 HUDSONBAY MOUNTAIN RD, SMITHERS, BC => 7 2660 FOLKE STONE WAY, WEST VANCOUVER, BC => 11

cmhodgson commented 3 years ago

What exactly do the numbers mean in the above table? Were there 56 cases of the exact address "3559 brendalee rd, westbank, bc" in the 695k sample of health addresses? Or you took a sample of 56 address from the 895k and found that one case?

These are all good cases, though I would point out that "highway3a" is actually relevant to a different issue, about missing spaces between the street name and type, which I find to generally be more difficult to fix but in this specific case because it is a numeric street name it might be easier to deal with.

bstratto commented 3 years ago

Hi @cmhodgson , The numbers mean the number of addresses found using that glued/unglued word, within the 695K. The address provided is an example of one of the addresses found with the words, but there are others with different civic numbers, etc. For example, for glued word "BRENDALEE", there are 56 addresses within the 695K that use this combined word. One is the example in the list, others are with different numbers, like "3557 BRENDALEE RD, WESTBANK, BC" and "3624 BRENDALEE RD, KELOWNA, BC"

cmhodgson commented 3 years ago

Thanks @bstratto, that makes sense, it wasn't clear to me for some reason at first look.

mraross commented 3 years ago

We have disabled this script as it doesn't work sufficently well.

alixcote commented 1 year ago

Resolved in Geocoder 4.2. Closing.

bcgov / ols-geocoder

Add support for glued and unglued words in site, street, and locality name matching #15