Open pjain1 opened 2 months ago
interesting; I think these things could be supported in general with some kind of semijoin
however for the above query I think there is an alterante way to write the query: instead of:
(SELECT "page", sum(added) AS "added" FROM "wikipedia" WHERE "__time" >= '2015-09-12T00:00:00.000Z' AND "__time" < '2015-09-12T01:00:00.000Z' GROUP BY 1 ORDER BY "added" DESC LIMIT 33) base
LEFT OUTER JOIN
(SELECT "page", sum(added) AS "added" FROM "wikipedia" WHERE "__time" >= '2015-09-12T01:00:00.000Z' AND "__time" < '2015-09-12T20:00:00.000Z' GROUP BY 1) comparison
ON
(base."page" IS NOT DISTINCT FROM comparison."page")
a single query could be given to collect both aggregates
SELECT "page",
sum(added) filter (where "__time" >= '2015-09-12T00:00:00.000Z' AND "__time" < '2015-09-12T01:00:00.000Z') as base_added,
sum(added) filter (where "__time" >= '2015-09-12T01:00:00.000Z' AND "__time" < '2015-09-12T20:00:00.000Z') AS comparison_added
FROM "wikipedia" WHERE "__time" >= '2015-09-12T00:00:00.000Z' AND "__time" < '2015-09-12T20:00:00.000Z' GROUP BY 1 ORDER BY "base_added" DESC LIMIT 33
I believe this rewrite could not be done in all cases; but might worth a try; could you check if this would help? or if yes how much?
not sure if the input query could be recognized and rewritten to this one automatically...it doesn't seem impossible; but right now I feel like that would be a bit fragile
Thanks @kgyrtkirk for the inputs, it works really well is cases where metric definition is simple. In case where metric expression is a little involved, filter needs to be added to each agg function for example sum(price) - sum(cost)
would need to become sum(price) FILTER (…) - sum(cost) FILTER (…)
and so on. Also tried a semi-join approach but that does not work well in Druid, planner essentially does an inner join with whole underlying table.
WITH
base AS (
SELECT page, sum(added) as added FROM "wikipedia" WHERE "__time" >= '2016-06-27T00:00:00.000Z' AND "__time" < '2016-06-27T01:00:00.000Z' GROUP BY 1 ORDER BY "added" DESC LIMIT 8
),
comparison AS (
SELECT page, sum(added) AS added FROM "wikipedia" WHERE "__time" >= '2016-06-27T01:00:00.000Z' AND "__time" < '2016-06-27T02:00:00.000Z'
AND (SELECT 1 FROM base WHERE page IS NOT DISTINCT FROM base.page LIMIT 1) = 1
GROUP BY 1
)
SELECT
base.page,
ANY_VALUE(base.added) AS base_added,
ANY_VALUE(comparison.added) AS comparison_added,
ANY_VALUE(base.added - comparison.added) AS added_delta
FROM base LEFT JOIN comparison ON (base.page IS NOT DISTINCT FROM comparison.page) GROUP BY 1 ORDER BY "base_added" DESC LIMIT 8
Somehow FILTER is 2x slower on my datasource. I tried these two:
select "camp",
sum(cnt) filter (where __time >= '2024-06-01' and __time < '2024-06-07') p,
sum(cnt) filter (where __time >= '2024-06-07' and __time < '2024-06-14') n,
sum(cnt) filter (where __time >= '2024-06-07' and __time < '2024-06-14') - sum(cnt) filter (where __time >= '2024-06-01' and __time < '2024-06-07') delta
from
atd_pba where __time >= '2024-06-01' and __time < '2024-06-14'
group by 1
order by delta
limit 10
select p."camp", any_value(p.c), any_value(n.c), any_value(n.c- p.c ) delta from (
select "camp" , sum(cnt) c from atd_pba where __time >= '2024-06-01' and __time < '2024-06-07' group by camp
) p
left join
(
select "camp" , sum(cnt) c from atd_pba where __time >= '2024-06-07' and __time < '2024-06-14' group by camp
) n
on p.camp is not distinct from n.camp
group by 1
order by delta
limit 10
Actually, as cardinality grows FILTER begins to outperform JOIN.
yes; it supposed to plan as a TopN
query - which works a bit differently ;nd iirc it doesn't support vectorization - so in smaller cases it may be slower
I added cardinality by grouping additional dimensions - now it looks like this:
select "camp", "buyer" , "source" ,
sum(cnt) filter (where __time >= '2024-06-01' and __time < '2024-06-07') p,
sum(cnt) filter (where __time >= '2024-06-07' and __time < '2024-06-14') n,
sum(cnt) filter (where __time >= '2024-06-07' and __time < '2024-06-14') - sum(cnt) filter (where __time >= '2024-06-01' and __time < '2024-06-07') delta
from atd_pba
where __time >= '2024-06-01' and __time < '2024-06-14'
group by 1,2,3
order by delta
limit 10
select p."camp", p. "buyer" ,p. "source",any_value(p.c), any_value(n.c), any_value(n.c- p.c ) delta from (
select "camp", "buyer" , "source" , sum(cnt) c from atd_pba where __time >= '2024-06-01' and __time < '2024-06-07' group by 1,2,3
) p
left join
(
select "camp", "buyer" , "source" , sum(cnt) c from atd_pba where __time >= '2024-06-07' and __time < '2024-06-14' group by 1,2,3
) n
on p.camp is not distinct from n.camp and p.buyer is not distinct from n.buyer and p.source is not distinct from n.source
group by 1,2,3
order by delta
limit 10
For a comparison query that compares previous time range to current time range on some metric like below
Druid calculates the base(left) and comparison(right) inner queries and joins them on
page
dimension. However if the dimension is high cardinality, the comparison query might fail withResourceLimitExceededException
as it might exceedmaxSubqueryBytes
limit. Limit cannot be pushed to the comparison query as it might have different topNs than the base one. An inner query selecting topn values from base time range also cannot be used in the where clause of comparison query as if there is anull
value then<NULL> IN NULL
comparison is false sonull
value in comparison query will be ignored.Druid can however compute the base query first and push the join values into the comparison query to limit the comparison query results. Can planner do this optimization or any other ideas ?