ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
37.69k stars 6.91k forks source link

Optimize GROUP BY with injective functions of order key prefix. #2914

Open VladRassokhin opened 6 years ago

VladRassokhin commented 6 years ago

Given table definition CREATE TABLE t ( A UInt32, B UInt32, C UInt32, D UInt32) ENGINE = MergeTree() PARTITION BY toUInt32(A / 128) ORDER BY (A, C, D) SETTINGS index_granularity = 8192 And ~160bn rows:

select sum(rows),formatReadableSize(sum(bytes)), formatReadableSize(sum(marks_size)), formatReadableSize(sum(primary_key_bytes_in_memory)), count() from system.parts where active = 1 and table = 't'
162573124579    252.25 GiB      1.18 GiB        227.11 MiB      133

Requests for field A are quite slow:

select A from t group by A into outfile 'test1';
  3060 rows in set. Elapsed: 117.517 sec. Processed 162.57 billion rows, 650.29 GB (1.38 billion rows/s., 5.53 GB/s.)
select distinct A from t into outfile 'test2';
  3060 rows in set. Elapsed: 135.116 sec. Processed 162.57 billion rows, 650.29 GB (1.20 billion rows/s., 4.81 GB/s.) 
select * from system.asynchronous_metrics where metric like '%ark%' FORMAT TabSeparatedWithNames;
  metric  value
  MarkCacheFiles  1568
  MarkCacheBytes  348396624

It seems that query could be optimized by looking whether first part of tuple changes in adjacent masks.

VladRassokhin commented 6 years ago

Here's script to reproduce:

-- Assume uniq(S) = 3000 in origin data
CREATE DATABASE IF NOT EXISTS test;
CREATE TABLE IF NOT EXISTS test.data (
  S UInt32,
  T UInt32,
  C UInt32,
  M UInt32
)
  ENGINE = MergeTree() PARTITION BY toUInt32(S / 128) ORDER BY (S, C, M);

INSERT INTO test.data (S, T, C, M) SELECT toUInt32(number / 500000),0,0,0 FROM system.numbers LIMIT 1500000000;
select uniq(S) from test.data;
-- 1 rows in set. Elapsed: 2.505 sec. Processed 1.50 billion rows, 6.00 GB (598.84 million rows/s., 2.40 GB/s.)

Results are the same for versions: 18.4.0 (docker tag 18.1.0), 18.10.3 (docker tag 18.10.3), 1.1.54394,

filimonov commented 6 years ago

In other words you're proposing to assume that the whole data block have the same value of column (which is a part of PK) if consequent values of that column in index are equal.

I.e. if 2 values of PK in sparse index are A and next is also A, just assume that whole block have value A repeated 8192 times (or index_granularity times) without reading the block from the disk?

VladRassokhin commented 6 years ago

I.e. if 2 values of PK in sparse index are A and next is also A, just assume that whole block have value A repeated 8192 times (or index_granularity times) without reading the block from the disk?

Correct, sounds reasonable. AFAIK for now it doesn't work even for simple one-column PrimaryKey

Imaskar commented 6 years ago

This could be reasoned about partitioning key, but not primary key. In PK it would lead to missing values (block was skipped, but shouldn't been).

VladRassokhin commented 6 years ago

@Imaskar index_granularity is used for Primary Key, so if we're not skipping parts (based on Partitioning Key), but only blocks with same indexes there would be no missed values.

Unfortunately both Partitioning and Primary starts with P, so there's a confusion with what PK means :)

filimonov commented 4 years ago

when table is ordered by (k1,k2,k3,...) group by k1 can use something like read_in_order and immediately send the group to the client after k1 value changed. (much more memory efficient, faster results).

Also there is one more optimization mentioned here. NOT reading primary key column file ranges if value can be deducted from PK. Maybe we should create one more task for that. :)

alexey-milovidov commented 4 years ago

This optimization is implemented by @dimarub2000 and prepared for production quality by @CurtizJ. There are a few more things to consider:

alexey-milovidov commented 4 years ago

@CurtizJ is on vacation, he will clarify after 20th August.

filimonov commented 3 years ago

Do we want to make further implovements here? Or we should close it?

alexey-milovidov commented 3 years ago

Yes:

alexey-milovidov commented 3 years ago

This task is not finished.