RobinL / uk_address_matcher

MIT License
6 stars 1 forks source link

Ideas to improve accuracy #1

Open RobinL opened 5 months ago

RobinL commented 5 months ago
RobinL commented 3 months ago

biggest categories of errors I've seen are:

(1) two addreses both get scored the same when one is better, usually because they both fall into the same token_rel_freq_arr_l bin, despite one having a slightly better score. Since it picks a random one of the two, it gets it wrong half the time (2) Issue with model weights means got it wrong because trained on too small data. e.g. image (3) Postcode is incorrect in EPC data, so a match is found but a match on true postcode never evaluated

So:

RobinL commented 3 months ago

Note that adding more levels don't have a huge impact on compute time:

df_for_test = df_predict_2.as_pandas_dataframe()

common_sql = """
list_reduce(
        list_prepend(
            1.0,
            list_transform(
                array_filter(
                    token_rel_freq_arr_l,
                    y -> array_contains(
                        array_intersect(
                            list_transform(token_rel_freq_arr_l, x -> x.tok),
                            list_transform(token_rel_freq_arr_r, x -> x.tok)
                        ),
                        y.tok
                    )
                ),
                x -> x.rel_freq
            )
        ),
        (p, q) -> p * q
    )
     *  
    list_reduce(
        list_prepend(
            1.0,
            list_transform(
                list_concat(
                    array_filter(
                        token_rel_freq_arr_l,
                            y -> not array_contains(
                                    list_transform(token_rel_freq_arr_r, x -> x.tok),
                                    y.tok
                                )
                    ),
                    array_filter(
                        token_rel_freq_arr_r,
                            y -> not array_contains(
                                    list_transform(token_rel_freq_arr_l, x -> x.tok),
                                    y.tok
                                )
                    )
                ),

                x -> x.rel_freq
            )
        ),
        (p, q) -> p / q^0.33
    )
    """

case_stmt = "\n".join([f"when {common_sql} > 1e-{i} then {i}" for i in range(1, 2)])

start_time = time.time()
sql = f"""
select
    case
    {case_stmt}
    end as thing
    from df_for_test

"""
display(duckdb.sql(sql).df())
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time}")

1.07 seconds for 1 level 3.59 for 40 levels

Nonetheless you could get quite a big speed boost by computing once and using many times