Closed den-crane closed 1 week ago
The same issue with CollapsingMergeTree
drop table if exists tp;
create table tp (type Int32, eventcnt UInt64, sign Int8,
projection p (select sum(eventcnt), type group by type))
engine = CollapsingMergeTree(sign) order by type;
insert into tp select number%3, 1,1 from numbers(3);
insert into tp select number%3, 1,-1 from numbers(3);
insert into tp select number%3, 2,1 from numbers(3);
optimize table tp final;
set allow_experimental_projection_optimization = 0, force_optimize_projection = 0;
select sum(eventcnt) eventcnt, type
from tp
group by type
┌─eventcnt─┬─type─┐
│ 2 │ 0 │
│ 2 │ 2 │
│ 2 │ 1 │
└──────────┴──────┘
set allow_experimental_projection_optimization = 1, force_optimize_projection = 1;
select sum(eventcnt) eventcnt, type
from tp
group by type
┌─eventcnt─┬─type─┐
│ 4 │ 0 │
│ 4 │ 2 │
│ 4 │ 1 │
└──────────┴──────┘
It's a known limitation (or perhaps an unexpected feature). But at least we should disable projection materialization for those MergeTree variants.
Similar problem occurs when using FINAL Keyword on query time:
CREATE TABLE ProjTest2
(
a_int UInt8,
b_int UInt16,
PROJECTION prj1 (SELECT a_int, max(b_int) AS max_b, count(*) AS count_total GROUP BY a_int)
) ENGINE=ReplacingMergeTree ORDER BY (a_int, b_int);
INSERT INTO ProjTest2
SELECT *
FROM generateRandom('a_int UInt8, b_int UInt16', NULL, 10)
LIMIT 50000000;
Without Projection:
SET allow_experimental_projection_optimization = 0, force_optimize_projection = 0;
SELECT
count(*),
a_int
FROM ProjTest2
GROUP BY a_int
ORDER BY count() DESC
LIMIT 3
Query id: 1bb133d8-3bc0-4882-aedc-e6a91b584e93
┌─count()─┬─a_int─┐
│ 129430 │ 188 │
│ 129369 │ 212 │
│ 129331 │ 246 │
└─────────┴───────┘
3 rows in set. Elapsed: 0.032 sec. Processed 32.97 million rows, 32.97 MB (1.02 billion rows/s., 1.02 GB/s.)
SELECT
count(*),
a_int
FROM ProjTest2
FINAL
GROUP BY a_int
ORDER BY count() DESC
LIMIT 3
Query id: a4dedc44-4595-45c2-a75e-9d57620ca37c
┌─count()─┬─a_int─┐
│ 62402 │ 112 │
│ 62347 │ 87 │
│ 62340 │ 150 │
└─────────┴───────┘
3 rows in set. Elapsed: 1.093 sec. Processed 32.97 million rows, 164.84 MB (30.16 million rows/s., 150.81 MB/s.)
With Projection enabled:
SET allow_experimental_projection_optimization = 1, force_optimize_projection = 1;
SELECT
count(*),
a_int
FROM ProjTest2
FINAL
GROUP BY a_int
ORDER BY count() DESC
LIMIT 3
Query id: f2d7daaf-9e5f-49fe-af6c-27634f758295
┌─count()─┬─a_int─┐
│ 2928 │ 59 │
│ 2901 │ 92 │
│ 2880 │ 178 │
└─────────┴───────┘
3 rows in set. Elapsed: 0.012 sec. Processed 2.05 thousand rows, 51.20 KB (164.19 thousand rows/s., 4.10 MB/s.)
As seen count() is way to small when using final on query time
As seen count() is way to small when using final on query time
Hmm, the result of select final
is indeed strange. I'll take a look. But in general we should not use projections in ReplacingMergeTree for now
As select final
will almost everytime need to scan multiple parts, i'd suggest that when the final keyword is used, the optimizer should always ignore aggregated projections, and just consider main table or projections with different sortings only. At least for any partition with multiple parts.
As for the original topic. i would not disable projections for Replacing/Collapsing mergetree (its the one we use most often currently), but rather 'force' an rebuild of the projection data whenever two parts are merged.
btw: very cool feature amos!
Can i use projection on replacingMergeTree now ? is this solved ?
Is there any plan on fixing this issue yet? BTW my workaround is after insertion, do optimize final, clear projection and materialize projection on that partition to rebuild it.
Hey guys, looks like we have the same issue with MergeTree
select version()
┌─version()─┐
│ 22.8.6.71 │
└───────────┘
create table tp (type Int32, eventcnt UInt64,
projection p (select sum(eventcnt), type group by type))
engine = MergeTree order by type;
insert into tp select number%3, 1 from numbers(3);
insert into tp select number%3, 2 from numbers(3);
optimize table tp final deduplicate by type;
set allow_experimental_projection_optimization = 0, force_optimize_projection = 0;
select sum(eventcnt) eventcnt, type
from tp
group by type
┌─eventcnt─┬─type─┐
│ 1 │ 0 │
│ 1 │ 2 │
│ 1 │ 1 │
└──────────┴──────┘
set allow_experimental_projection_optimization = 1, force_optimize_projection = 1;
select sum(eventcnt) eventcnt, type
from tp
group by type
┌─eventcnt─┬─type─┐
│ 2 │ 0 │
│ 2 │ 2 │
│ 2 │ 1 │
└──────────┴──────┘
explain select sum(eventcnt) eventcnt, type
from tp
group by type
┌─explain────────────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│ ReadFromStorage (MergeTree(with Aggregate projection p)) │
└────────────────────────────────────────────────────────────┘
@amosbird Any news on fixing this bug sometime in the future? I lately had a customer falling in this trap again.
How i see things (unfortunatelly not a c++ developer), there should be a trigger added for recalculating the projections when merges occur in ReplacingMergeTree.
Interesting fact, if i run a OPTIMIZE TABLE xyz FINAL
then it does not recalculate the projections based on the newly merged dataparts, but if i run a ALTER TABLE xyz UPDATE col1=col1 WHERE 1=1
, then it does update the projection data.
So i think it's really just a trigger missing to fire when data parts are merged
I have just found this issue the hard way after getting inconsistent results from my queries depending on the use of a projection.
I must say that finding this information only in the issues section of the project and not in the documentation - neither of projections nor of the CollapsingMergeTree - is not very user friendly. ☹️ Especially since this limitation is known for more than 2 years...
@beda42 documentation is open sourced in https://github.com/ClickHouse/ClickHouse/tree/master/docs/ don't be shy to make pull request
Let's implement a similar setting as described for lightweight mutations here.
A setting with the name deduplicating_merge_projection_mode
(or something similar) with options:
throw
(by default). Do not allow creating a projection for the table with Replacing\Collapsing\VersionedCollapsing\Summing\Aggregating
engines. (In other words, allow only classic MergeTree\ReplicatedMergeTree
). Do not allow OPTIMIZE DEDUPLICATE
(for all engines). Merges should not throw any exceptions by themselves but work as a drop
option. Later, allow some cases where projection can be updated.drop
. Do not throw any exception, but drop the projection in case of merge. Currently, it would be pretty useless (only 0-level parts would have projections). But it is useful in OPTIMIZE DEDUPLICATE
for classic MergeTree\ReplicatedMergeTree
. Also, it will be more useful if some projections can be kept (in the future).rebuild
(later). As implemented in https://github.com/ClickHouse/ClickHouse/pull/62364It's funny that throw
is applied in CREATE PROJECTION
, but works as drop
for merges. It can be a separate setting, but right now I think it's better to keep it in the same enum (the option throw + rebuild
seems inconsistent to me).