With TF adjustments you get an error such as DB::Exception: Identifier 'l.city' cannot be resolved from subquery with name l
More specifically:
you can have one tf-adjusted comparison, but no more
this also applies to ChDBAPI when debug_mode is switched on
Causes
The issue stems from the way that column names are generated from *-expressions in Clickhouse in cases where we have multi-joins. In such instances it seems that column names have their origin-table prepended, even when there is no possibility for ambiguity. This is not the behaviour Splink expects (and is exhibited by other backends), so the columns Splink refers to in subsequent queries do not exist (as they are named differently in the Clickhouse case), and so the query fails.
I haven't checked in detail, but presumably the different behaviour between the two engines + debug mode is that the column-resolution works differently in chdb, but also differently depending on whether the expression is in a CTE or not.
Here is a query that demonstrates the issue (and is the shape of the query that Splink fails at):
WITH __splink__df_concat AS (
SELECT
arrayJoin(['london', 'bristol', 'brighton']) AS city,
arrayJoin(['l@nd.on', 'br@st.ol', 'br@ght.on']) AS email
),
__splink__df_tf_city AS (
SELECT
arrayJoin(['london', 'bristol', 'brighton']) AS city,
arrayJoin([10, 1, 0.5]) AS tf_city
),
__splink__df_tf_email AS (
SELECT
arrayJoin(['l@nd.on', 'br@st.ol', 'br@ght.on']) AS email,
arrayJoin([2, 1, 0.8]) AS tf_email
)
SELECT
__splink__df_concat.*,
__splink__df_tf_city."tf_city",
__splink__df_tf_email."tf_email"
FROM
__splink__df_concat
LEFT JOIN __splink__df_tf_city ON __splink__df_concat."city" = __splink__df_tf_city."city"
LEFT JOIN __splink__df_tf_email ON __splink__df_concat."email" = __splink__df_tf_email."email"
The resulting table has columns __splink__df_concat.city, __splink__df_concat.email, tf_city and tf_email. If we remove the references to email (as would happen if we only had a single tf adjustment) we would instead have columns city, tf_city - i.e. there is no 'disambiguating' column prefix, and so the Splink queries go through unimpeded.
Reprex
This is a minimal example showing the actual failure:
This a slightly deeper issue than some others that have caused SQL execution to fail, and so simple string-replacement in the SQL is probably not a goer. We might be able to do something fancier with it, or parse the SQL into an AST and deal with it at that level, but that:
may not really be possible
would probably be overly complex
would probably be a pretty brittle solution
Probably the neatest solution would be to make a tweak to the SQL upstream to make these column names explicit, as long as this doesn't have any wider impact.
With TF adjustments you get an error such as
DB::Exception: Identifier 'l.city' cannot be resolved from subquery with name l
More specifically:
ChDBAPI
whendebug_mode
is switched onCauses
The issue stems from the way that column names are generated from *-expressions in Clickhouse in cases where we have multi-joins. In such instances it seems that column names have their origin-table prepended, even when there is no possibility for ambiguity. This is not the behaviour Splink expects (and is exhibited by other backends), so the columns Splink refers to in subsequent queries do not exist (as they are named differently in the Clickhouse case), and so the query fails.
I haven't checked in detail, but presumably the different behaviour between the two engines + debug mode is that the column-resolution works differently in
chdb
, but also differently depending on whether the expression is in a CTE or not.Here is a query that demonstrates the issue (and is the shape of the query that Splink fails at):
The resulting table has columns
__splink__df_concat.city
,__splink__df_concat.email
,tf_city
andtf_email
. If we remove the references to email (as would happen if we only had a single tf adjustment) we would instead have columnscity
,tf_city
- i.e. there is no 'disambiguating' column prefix, and so the Splink queries go through unimpeded.Reprex
This is a minimal example showing the actual failure:
or similarly for
chdb
+ debug:Fixing
This a slightly deeper issue than some others that have caused SQL execution to fail, and so simple string-replacement in the SQL is probably not a goer. We might be able to do something fancier with it, or parse the SQL into an AST and deal with it at that level, but that:
Probably the neatest solution would be to make a tweak to the SQL upstream to make these column names explicit, as long as this doesn't have any wider impact.