Open RobinL opened 5 months ago
biggest categories of errors I've seen are:
(1) two addreses both get scored the same when one is better, usually because they both fall into the same token_rel_freq_arr_l
bin, despite one having a slightly better score. Since it picks a random one of the two, it gets it wrong half the time
(2) Issue with model weights means got it wrong because trained on too small data. e.g.
(3) Postcode is incorrect in EPC data, so a match is found but a match on true postcode never evaluated
So:
Note that adding more levels don't have a huge impact on compute time:
df_for_test = df_predict_2.as_pandas_dataframe()
common_sql = """
list_reduce(
list_prepend(
1.0,
list_transform(
array_filter(
token_rel_freq_arr_l,
y -> array_contains(
array_intersect(
list_transform(token_rel_freq_arr_l, x -> x.tok),
list_transform(token_rel_freq_arr_r, x -> x.tok)
),
y.tok
)
),
x -> x.rel_freq
)
),
(p, q) -> p * q
)
*
list_reduce(
list_prepend(
1.0,
list_transform(
list_concat(
array_filter(
token_rel_freq_arr_l,
y -> not array_contains(
list_transform(token_rel_freq_arr_r, x -> x.tok),
y.tok
)
),
array_filter(
token_rel_freq_arr_r,
y -> not array_contains(
list_transform(token_rel_freq_arr_l, x -> x.tok),
y.tok
)
)
),
x -> x.rel_freq
)
),
(p, q) -> p / q^0.33
)
"""
case_stmt = "\n".join([f"when {common_sql} > 1e-{i} then {i}" for i in range(1, 2)])
start_time = time.time()
sql = f"""
select
case
{case_stmt}
end as thing
from df_for_test
"""
display(duckdb.sql(sql).df())
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time}")
1.07 seconds for 1 level 3.59 for 40 levels
Nonetheless you could get quite a big speed boost by computing once and using many times
unusual tokens arr etc. should be sorted, at the moment it's not leading to missed matches
numeric tokens - if there aren't any put the non numeric in their own field, that allows you to match them differently to numeric
find a way of adding a bit to the match score if an exact match to disambiguate when two get teh same score but one is more matchy
maybe use a simple levensheltwin on full address to disambiguate between equal score matches as part of an algorithm that labels the distinguish ability of the match
HOUSE - LIME becomes HOUSE-LIME, which is not what we want
One match got missed because of
12A
vs12/A
which got cleaned to12-A
so they disagreed in the numeric tokenNeed some way of checking both number of matching tokens AND order
when looking at token overlap, how to analyse whether they're in the right order? Filter to overlapping tokens then ask whether they're in the same order? Prob best to retain the token before and hte token after the numeric, and see whether htey're both right or at least one is
Should we weight tokens near the start of the address higher than tokens near the end. It's fairly common to have e.g. a missing county name, but that doesn't matter much. Could do this by adding a position index to the relative frequency array {token:, freq:, position: } and then using position to weight score
Lots of potential options for further data cleaning e.g. here
Allow fuzzy matching on tokens in the arrays. Since we're iterating through the tokens anyway, shouldn't add too much complexity
Worth geocoding addresses? Does it add anything or are the geolocation models as likely to make a mistake as the matching model. Obvs if you have a lat lng it's worth using, but in its absence does geocoding from the address alone add anything
Multiple models ,one for flats, one for non-flats
Obtain a list of tokens which commonly are committed from addresses and filter them out into their own field i.e. amongst matching addresses, what tokens are different
common end tokens is more generally 'common mismatching tokens', c/o would be one too. Could we curate a list of them?
Clean 1 A to 1A i.e. two tokens next to each other of length 1 (or 2?) - merge into a single token ✅ DONE
a missing second number is probably generally not so much a null as a 'does not exist' type thing, there should be punishment ✅ DONE
Better logic for e.g. inversion of numbers. At the moment, a match on the inversion of numbers is probably scored too highly When looking at token inversions on the numeric column, need to account for nearby tokens. i.e. it's inverted, and the token after the numeric suggests suggests it's a genuine inversion ✅ DONE
A 'colum' that encodes whether there are the ame number of numeric parts of the address on both sides of hte match i.e. it seems relevant whether both sides of the match have the same number of tokens or one has more. A null is not necessarily zero information✅ DONE
Parse out common tokens near the end of addresses - e.g. counties etc. These tokens can be detected as frequent terms that appear in the last or penultimate token. Maybe want to eliminate these tokens, but irrespective probably don't want to account for them as part of the 'punishment' of non-matching tokens ✅ DONE