ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
37.03k stars 6.83k forks source link

Invalid JOIN_ON condition #40976

Open miroha opened 2 years ago

miroha commented 2 years ago

I'm new with CH and I'm trying to JOIN two tables. My first attemp was:

select 
  u.counter_id, 
  r.date_of_visit, 
  sum(r.sessions) as sessions, 
  sum(r.pageviews) as pageviews, 
  u.utm_campaign, 
  u.utm_source, 
  u.utm_medium
from 
  connectors_yandex_metrika.utm_for_collect u 
  inner join connectors_yandex_metrika.utm_sessions_report r on u.counter_id = r.counter_id 
  and (
    u.utm_campaign = r.utm_campaign 
    or u.utm_campaign IS NULL
  ) 
  and (
    u.utm_source = r.utm_source 
    or u.utm_source IS NULL
  ) 
  and (
    u.utm_medium = r.utm_medium 
    or u.utm_medium IS NULL
  ) 
group by 
  u.counter_id, 
  r.date_of_visit, 
  u.utm_campaign, 
  u.utm_source, 
  u.utm_medium

I received an error: DB::Exception: Unsupported JOIN ON conditions. Unexpected '(utm_campaign = r.utm_campaign) OR (utm_campaign IS NULL)': While processing (utm_campaign = r.utm_campaign) OR (utm_campaign IS NULL). (INVALID_JOIN_ON_EXPRESSION) (version 22.7.2.15 (official build))

Then I decided to use coalesce:

select 
u.counter_id, 
u.utm_campaign,
u.utm_source,
u.utm_medium,
r.date_of_visit,
       sum(r.sessions) as sessions,
       sum(r.pageviews) as pageviews
  from connectors_yandex_metrika.utm_for_collect u 
       inner join connectors_yandex_metrika.utm_sessions_report r 
         on (u.counter_id, 
             coalesce(u.utm_campaign, r.utm_campaign),
             coalesce(u.utm_source, r.utm_source),
             coalesce(u.utm_medium, r.utm_medium)
            ) = (r.counter_id, r.utm_campaign, r.utm_source, r.utm_medium)
 group by u.counter_id, 
u.utm_campaign,
u.utm_source,
u.utm_medium
r.date_of_visit

Error: DB::Exception: Invalid columns in JOIN ON section. Columns counter_id and r.utm_campaign are from different tables.: While processing (counter_id, coalesce(utm_campaign, r.utm_campaign), coalesce(utm_source, r.utm_source), coalesce(utm_medium, r.utm_medium)) = (r.counter_id, r.utm_campaign, r.utm_source, r.utm_medium). (INVALID_JOIN_ON_EXPRESSION) (version 22.7.2.15 (official build))

And my last attempt was:

with r as (
  select 
    counter_id, 
    date_of_visit, 
    sum(sessions) as sessions, 
    sum(pageviews) as pageviews, 
    coalesce(utm_campaign, '') as utm_campaign, 
    coalesce(utm_source, '') as utm_source, 
    coalesce(utm_medium, '') as utm_medium
  from 
    connectors_yandex_metrika.utm_sessions_report
  group by 
    counter_id, 
    date_of_visit, 
    cube(5, 6, 7)
) 
select 
  r.* 
from 
  r 
  join connectors_yandex_metrika.utm_for_collect u on r.counter_id = u.counter_id 
  and (
    r.utm_campaign, r.utm_source, r.utm_medium
  ) is not distinct 
from 
  (
    u.utm_campaign, u.utm_source, u.utm_medium
  ) 
where 
  'NA' in (
    r.utm_campaign, r.utm_source, r.utm_medium
  ) is not true

Error: DB::Exception: Syntax error: failed at position 589 ('distinct') (line 24, col 12): distinct from ( u.utm_campaign, u.utm_source, u.utm_medium ) where 'NA' in ( r.utm_campaign, r.utm_source, r.utm_medium ) is not tru. Expected one of: NULL, end of query. (SYNTAX_ERROR) (version 22.7.2.15 (official build))

All these queries work fine on PostgreSQL. Any tips how to adapt this to CH?

vdimir commented 2 years ago

You can try use FULL join with WHERE:

select ...
from 
  connectors_yandex_metrika.utm_for_collect u
FULL JOIN 
  connectors_yandex_metrika.utm_sessions_report r
ON u.counter_id = r.counter_id 
WHERE
  (
    u.utm_campaign = r.utm_campaign 
    or u.utm_campaign IS NULL
  ) 
  and (
    u.utm_source = r.utm_source 
    or u.utm_source IS NULL
  ) 
  and (
    u.utm_medium = r.utm_medium 
    or u.utm_medium IS NULL
  ) 

It's a cross join of all rows with the same counter_id, so it can be slower, but if the amount of records with non-unique counter_id is not so much, it will work.