ClickHouse / clickhouse-docs

Official documentation for ClickHouse
https://clickhouse.com/docs
Other
112 stars 259 forks source link

Docs explaining the ORDER BY alignment between DDL and query #900

Open thomoco opened 1 year ago

thomoco commented 1 year ago

per this Community Slack discussion here: https://clickhousedb.slack.com/archives/CU478UEQZ/p1678474718774849

Original question: I'm using a timestamp in my ORDER BY schema and I'm noticing that when I query in the reverse order, the query time is significantly slower. Does clickhouse do any optimization to help here?

@nickitat 's remedy: from the logs it looks like CH wasn’t able to apply reading in order because you have slightly different expressions in table sorting key and in query’s order by clause. pls try

For some background, sharing the original DDL and queries (including here since already public in Slack):

Original DDL: CREATE TABLE logs ( Timestamp DateTime64(9) CODEC (Delta, ZSTD(1)), UUID UUID CODEC (ZSTD(1)), TraceId String CODEC (ZSTD(1)), SpanId String CODEC (ZSTD(1)), TraceFlags UInt32 CODEC (ZSTD(1)), Body String CODEC (ZSTD(1)), LogAttributes Map(LowCardinality(String), String) CODEC (ZSTD(1)), ProjectId UInt32 CODEC (ZSTD(1)), SecureSessionId String CODEC (ZSTD(1)), INDEX idx_trace_id TraceId TYPE bloom_filter GRANULARITY 1, INDEX idx_secure_session_id SecureSessionId TYPE bloom_filter GRANULARITY 1, INDEX idx_log_attr_key mapKeys(LogAttributes) TYPE bloom_filter GRANULARITY 1, INDEX idx_log_attr_value mapValues(LogAttributes) TYPE bloom_filter GRANULARITY 1, INDEX idx_body Body TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 1 ) ENGINE = MergeTree PARTITION BY toDate(Timestamp) ORDER BY (ProjectId, toUnixTimestamp(Timestamp), UUID) TTL toDateTime(Timestamp) + toIntervalDay(30) SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1;

Original Queries: `// Query in ascending order SELECT * FROM logs WHERE ProjectId = 1 AND Body ILIKE '%STRIPE_INTEGRATION_ERROR%' AND Timestamp <= now() AND Timestamp >= (now() - toIntervalDay(15)) ORDER BY Timestamp ASC LIMIT 100

=> Elapsed: 25.690s Read: 64,618,685 rows (58.94 GB)

// Query in descending order SELECT * FROM logs WHERE ProjectId = 1 AND Body ILIKE '%identified session%' AND Timestamp <= now() AND Timestamp >= (now() - toIntervalDay(15)) ORDER BY Timestamp DESC LIMIT 100

=> Elapsed: 98.482s Read: 169,133,320 rows (179.71 GB)`

`SELECT * FROM logs WHERE ProjectId = 1 AND Timestamp <= '2023-03-10 20:39:09.801403533' ORDER BY Timestamp DESC LIMIT 10

Elapsed: 82.265s Read: 170,664,402 rows (181.65 GB)`

Remedied Query (suggested by @nickitat : SELECT * FROM logs WHERE ProjectId = 1 AND Timestamp <= '2023-03-10 20:39:09.801403533' ORDER BY toUnixTimestamp(Timestamp) DESC LIMIT 10

Let's please document the concept provided in Nikita's recommendation. cc: @DanRoscigno @rfraposa

rfraposa commented 1 year ago

Just as a note...this is good feedback, but what happened is the table was sorted by toUnixTimestamp(Timestamp) but the query used ORDER BY Timestamp, so it's not clear where to document something like this. I suppose we could create a Knowledge Base article with a title like "Are you using your primary key in your queries properly?".

thomoco commented 1 year ago

KB is good thanks