ADBond / splinkclickhouse

Allows Clickhouse to be used as the execution engine for Splink
MIT License
5 stars 0 forks source link

Term frequency adjustments don't work with `ClickhouseAPI` #31

Open ADBond opened 2 months ago

ADBond commented 2 months ago

With TF adjustments you get an error such as DB::Exception: Identifier 'l.city' cannot be resolved from subquery with name l

More specifically:

Causes

The issue stems from the way that column names are generated from *-expressions in Clickhouse in cases where we have multi-joins. In such instances it seems that column names have their origin-table prepended, even when there is no possibility for ambiguity. This is not the behaviour Splink expects (and is exhibited by other backends), so the columns Splink refers to in subsequent queries do not exist (as they are named differently in the Clickhouse case), and so the query fails.

I haven't checked in detail, but presumably the different behaviour between the two engines + debug mode is that the column-resolution works differently in chdb, but also differently depending on whether the expression is in a CTE or not.

Here is a query that demonstrates the issue (and is the shape of the query that Splink fails at):

WITH __splink__df_concat AS (
    SELECT
        arrayJoin(['london', 'bristol', 'brighton']) AS city,
        arrayJoin(['l@nd.on', 'br@st.ol', 'br@ght.on']) AS email
),
__splink__df_tf_city AS (
    SELECT
        arrayJoin(['london', 'bristol', 'brighton']) AS city,
        arrayJoin([10, 1, 0.5]) AS tf_city
),
__splink__df_tf_email AS (
    SELECT
        arrayJoin(['l@nd.on', 'br@st.ol', 'br@ght.on']) AS email,
        arrayJoin([2, 1, 0.8]) AS tf_email
)
SELECT
    __splink__df_concat.*,
    __splink__df_tf_city."tf_city",
    __splink__df_tf_email."tf_email"
FROM
    __splink__df_concat
    LEFT JOIN __splink__df_tf_city ON __splink__df_concat."city" = __splink__df_tf_city."city"
    LEFT JOIN __splink__df_tf_email ON __splink__df_concat."email" = __splink__df_tf_email."email"

The resulting table has columns __splink__df_concat.city, __splink__df_concat.email, tf_city and tf_email. If we remove the references to email (as would happen if we only had a single tf adjustment) we would instead have columns city, tf_city - i.e. there is no 'disambiguating' column prefix, and so the Splink queries go through unimpeded.

Reprex

This is a minimal example showing the actual failure:

import clickhouse_connect
import splink.comparison_library as cl
from splink import Linker, SettingsCreator, block_on, splink_datasets

from splinkclickhouse import ClickhouseAPI

df = splink_datasets.fake_1000

conn_atts = {
    "host": "localhost",
    "port": 8123,
    "username": "splinkognito",
    "password": "splink123!",
}

db_name = "__temp_splink_db"

default_client = clickhouse_connect.get_client(**conn_atts)
default_client.command(f"CREATE DATABASE IF NOT EXISTS {db_name}")
client = clickhouse_connect.get_client(
    **conn_atts,
    database=db_name,
)

db_api = ClickhouseAPI(client)

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.JaroWinklerAtThresholds("first_name").configure(
            term_frequency_adjustments=True
        ),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
        ),
        cl.DamerauLevenshteinAtThresholds("city").configure(
            term_frequency_adjustments=True
        ),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "dob"),
        block_on("surname"),
    ],
)

db_api.delete_tables_created_by_splink_from_db()

linker = Linker(df, settings, db_api)

linker.inference.predict(threshold_match_weight=-5)

or similarly for chdb + debug:

import splink.comparison_library as cl
from chdb import dbapi
from splink import Linker, SettingsCreator, block_on, splink_datasets

from splinkclickhouse import ChDBAPI

df = splink_datasets.fake_1000

db_api = ChDBAPI(dbapi.connect())

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.JaroWinklerAtThresholds("first_name").configure(
            term_frequency_adjustments=True
        ),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
        ),
        cl.DamerauLevenshteinAtThresholds("city").configure(
            term_frequency_adjustments=True
        ),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "dob"),
        block_on("surname"),
    ],
)

db_api.delete_tables_created_by_splink_from_db()

linker = Linker(df, settings, db_api)
db_api.debug_mode = True

linker.inference.predict(threshold_match_weight=-5)

Fixing

This a slightly deeper issue than some others that have caused SQL execution to fail, and so simple string-replacement in the SQL is probably not a goer. We might be able to do something fancier with it, or parse the SQL into an AST and deal with it at that level, but that:

Probably the neatest solution would be to make a tweak to the SQL upstream to make these column names explicit, as long as this doesn't have any wider impact.