getsentry / snuba-sdk

SDK for generating SnQL queries for Snuba
https://getsentry.github.io/snuba-sdk/snuba_sdk.html
Other
7 stars 5 forks source link

Add support for DISTINCT and DISTINCT ON #116

Open cmanallen opened 1 year ago

cmanallen commented 1 year ago

Example queries:

SELECT DISTINCT col1, col2 FROM table
SELECT DISTINCT ON (col1, col2) FROM table

Why it matters:

┌─memory_usage─┬─query_duration_ms─┬─query─────────────────────────────────────────────────────────┐
│     19201072 │                26 │ select replay_id from replays_test group by replay_id limit 1 │
│      4611493 │                15 │ select distinct replay_id from replays_test limit 1           │
└──────────────┴───────────────────┴───────────────────────────────────────────────────────────────┘

Real world data for a query which is guaranteed to match every row in the set:

SELECT DISTINCT replay_id
FROM replays_dist
WHERE project_id = x AND has(urls, 'not-in-set') = 0
LIMIT 1

-- [Information] executeQuery: Read 8192 rows, 378.23 KiB in 0.054623468 sec., 149972 rows/sec., 6.76 MiB/sec.
-- [Debug] MemoryTracker: Peak memory usage (for query): 28.02 MiB.
SELECT replay_id
FROM replays_dist
WHERE project_id = x
GROUP BY replay_id
HAVING has(urls, 'not-in-set') = 0
LIMIT 1

-- [Information] executeQuery: Read 13239640 rows, 702.62 MiB in 0.89119408 sec., 14856068 rows/sec., 788.41 MiB/sec.
-- [Debug] MemoryTracker: Peak memory usage (for query): 128.29 MiB.
SELECT replay_id
FROM replays_dist
WHERE project_id = x AND has(urls, 'not-in-set') = 0
GROUP BY replay_id
LIMIT 1

-- [Information] executeQuery: Read 13239640 rows, 702.62 MiB in 0.89119408 sec., 14856068 rows/sec., 788.41 MiB/sec.
-- [Debug] MemoryTracker: Peak memory usage (for query): 128.29 MiB.
cmanallen commented 1 year ago

Adding additional information on how the distinct clause matches up against a group by clause and a regular row scan.

┌─query_duration_ms─┬─read_rows─┬─read_bytes─┬─result_rows─┬─memory_usage─┬─query────────────────────────────────────────────────────────────┐
│                29 │   1000000 │   16000000 │           1 │     14575690 │ select replay_id from replays_test group by replay_id limit 1    │
│                17 │    344064 │    5505024 │           1 │            0 │ select distinct replay_id from replays_test limit 1              │
│                 6 │         1 │         16 │           1 │            0 │ select replay_id from replays_test limit 1                       │
└───────────────────┴───────────┴────────────┴─────────────┴──────────────┴──────────────────────────────────────────────────────────────────┘