suspicious timings for groupby q10

jangorecki commented 7 months ago

groupby q10 is quite specific query which meant to stress aggregations really well.

Looking at current results I observe that duckdb and clickhouse are suspiciously well dealing with the question (0.14s, 0.47s), comparing to remaining solutions (1.56s).

My guess, and what I would like to clarify in this issue, is that those databases are holding statistics of the tables, and run queries faster using those statistics. Could you provide a comment, if that is likely scenario? For duckdb @Tmonster, and clickhouse @qoega ?

If that is the case, then:

there should be a note about that, probably even the first note just below the timings;
we should look for corresponding features in other solutions. For example, in data.table test script we could set indices (these are data.table native statistics that could be enabled), or a key. Those optimizations are currently not there.

qoega commented 7 months ago

As I understand you talk both about sorted and unsorted results.

For ClickHouse there are no additional indexes built.

For unsorted case data is stored in insert order.

LowCardinality columns allow to aggregate/compare values as integers and then replace to value from a hashtable. LC mapping is not global - it is just on a block level, so I do not consider it as an index. Just specific data codec. It still needs to merge this mappings between blocks.

To reproduce locally on tiny server I just use this part without "insert into ans" as it will not show EXPLAIN for insert query. No indices are used

CREATE TABLE G1_1e9_1e2_0_0 (id1 LowCardinality(Nullable(String)), id2 LowCardinality(Nullable(String)), id3 Nullable(String), id4 Nullable(Int32), id5 Nullable(Int32), id6 Nullable(Int32), v1 Nullable(Int32), v2 Nullable(Int32), v3 Nullable(Float64)) ENGINE = MergeTree() ORDER BY tuple();
INSERT INTO G1_1e9_1e2_0_0 SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/h2o/G1_1e9_1e2_0_0.csv.gz');

EXPLAIN indexes=1 SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count() AS cnt FROM G1_1e9_1e2_0_0 GROUP BY id1, id2, id3, id4, id5, id6 FORMAT Null;

┌─explain──────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY))      │
│   Aggregating                                    │
│     Expression (Before GROUP BY)                 │
│       ReadFromMergeTree (default.G1_1e9_1e2_0_0) │
└──────────────────────────────────────────────────┘

EXPLAIN PIPELINE　SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count() AS cnt FROM G1_1e9_1e2_0_0 GROUP BY id1, id2, id3, id4, id5, id6

┌─explain──────────────────────────────────────────────────────────────────────────┐
│ (Expression)                                                                     │
│ ExpressionTransform × 8                                                          │
│   (Aggregating)                                                                  │
│   Resize 8 → 8                                                                   │
│     AggregatingTransform × 8                                                     │
│       StrictResize 8 → 8                                                         │
│         (Expression)                                                             │
│         ExpressionTransform × 8                                                  │
│           (ReadFromMergeTree)                                                    │
│           MergeTreeSelect(pool: PrefetchedReadPool, algorithm: Thread) × 8 0 → 1 │
└──────────────────────────────────────────────────────────────────────────────────┘

So I think result on 150 cores server is just that fast...

jangorecki commented 7 months ago

I don't mean sorted/unsorted, but collecting statistics about the data and re-using those statistics in later queries.

qoega commented 7 months ago

I understand it. When data is sorted there is some notion of primary index. It is not used for this query to filter something straight ahead. When it is unsorted there is no index even on primary key. We do a full scan on necessary columns in both scenarios and aggregate them explicitly. There is no count/sum index per key or something, so I expect answer is "there is no precomputed data that is used to optimize this query".

jangorecki commented 7 months ago

@qoega ok, thanks for clarification. If you will happen to be again on running the script in future, could you swap q10 and run at start of the script, before q1, and let us know what's the timing?

qoega commented 7 months ago

I will try, ok. From my understanding data still fits into file cache after it is inserted so there should be no difference in level of data caching. Previous queries do not leave any statics and we do not compute any per column statistics yet. I would love to have this feature to better optimize query execution, but we are not there yet.

Tmonster commented 7 months ago

DuckDB doesn't use an index, and statistics aren't giving any advantage here. DuckDB aggregates in two phases. In the first phase threads have their own fixed-size hash table, which is small enough to mostly stay in the CPU caches. Data is also partitioned in this phase. In the second phase, thread-local data is combined, and each thread combines a partition. At this point we have all the data, so DuckDB can allocate a hash table that is big enough to fit everything for the partition. This strategy prevents a lot of random access and also prevents any resizing of hash tables during the first and second phase. This design also scales well with the high number of threads on the machine.

jangorecki commented 7 months ago

OK, thank you both for explanations. Those timings looks amazingly impressive!

duckdblabs / db-benchmark

suspicious timings for groupby q10 #70