ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
36.68k stars 6.78k forks source link

Continuous profiling: queries and flamegraphs #67615

Open yawnt opened 1 month ago

yawnt commented 1 month ago

Use case

Efficiently querying continuous profiling data in ClickHouse to drive flamegraph visualization would be a great addition to the logging, metrics and traces use case - especially in light of OpenTelemetry's decision to officially support it

Describe the solution you'd like

Describe alternatives you've considered

I've loaded up around ~3.29 billion profiler samples (~30 days) in a ClickHouse instance to generate flamegraphs. There are a few visualizers for flamegraphs (speedscope, flamegraph.pl), but I focused on Grafana's flame graph panel because:

  1. the format is quite compact compared to ie: flamegraph.pl: it's a depth-first traversal of the flamegraph with 4 fields: label, level, total, and self - with self being the cumulative sum of the box minus the cumulative sum of its children
  2. it's very interactive compared to a static svg
  3. ClickHouse Grafana plugin already supports Observability Traces and Logs, and Altinity's ClickHouse Grafana plugin also has added support for Flamegraphs (https://github.com/Altinity/clickhouse-grafana/pull/539)
  4. Grafana is quite popular :)

I tried three different approaches:

Pure SQL Works quite well actually! Starts to struggle once we hit a few days of profiling data because of the explosion caused by the `arrayJoin` and the `ORDER BY` and `GROUP BY` are high-cardinality due to the unique sub paths. ```SQL CREATE TABLE profilerSamples( timestamp DateTime, sampleCount UInt64, stackFrames Array(String), -- Could be a UUID / UInt128 and joined via a dictionary label1 String, label2 String ) ENGINE = MergeTree() -- Could be a Summing/AggregatingMergeTree ORDER BY (label1, label2, toStartOfMinute(timestamp)) ``` ```SQL WITH groupedProfilerSamples AS ( -- Reduces number of rows by grouping same stack frames together SELECT stackFrames, sum(sampleCount) as sampleCount FROM profilerSamples WHERE timestamp BETWEEN '' AND '' AND label1 = '' AND label2 = '' GROUP BY stackFrames ), summedProfilerSamples AS ( -- Explodes into individual flame graph boxes -- e.g.: -- stack=[a,b] samples=100 -- stack=[a,b,c] samples=200 -- stack=[e] samples=300 -- becomes -- subpath=[a] samples=100 self=false -- subpath=[a,b] samples=100 self=true -- subpath=[a] samples=200 self=false -- subpath=[a,b] samples=200 self=false -- subpath=[a,b,c] samples=200 self=true -- subpath=[e] samples=300 self=true SELECT arrayResize(stackFrames, stackLevel) AS stackSubpath, sum(samplesCount) AS samplesCount, arrayJoin(arrayEnumerate(stackFrames)) AS stackLevel, length(stackFrames) = stackLevel AS isSelf FROM groupedProfilerSamples ) SELECT 0 AS level, sum(samplesCount) AS value, 'All' AS label, 0 AS self FROM groupedProfilerSamples UNION ALL -- the structure of `groupedProfilerSamples` allows us to group -- and sum the same boxes and then print the tree in depth-first -- order by sorting subpaths lexicographically SELECT length(stackSubpath) AS level, sum(samplesCount) AS value, stackSubpath[-1] AS label, sumIf(samplesCount, isSelf) AS self FROM groupedProfilerSamples GROUP BY stackSubpath ORDER BY stackSubpath ```

flameGraph Aggregate Function I've tried running the `flameGraph` function against a modified table similar to the one detailed under "Pure SQL", but the fact that it stores children as a linked list means it slows down as the tree size grows. The `flamegraph.pl` format is also quite verbose due to the repetition of stack frames.

Executable Table Function I've written a small C++ program which reads the equivalent of the `groupedProfilerSamples SELECT` query from "Pure SQL", ingests it into a tree structure which uses a map to hold children and prints the tree to `STDOUT` in depth-first order. It also works quite well when I try to read data exported via `INTO OUTFILE` in `RowBinary` format, but I've hit issue https://github.com/ClickHouse/ClickHouse/issues/66646 when running it directly inside ClickHouse (meaning my `GROUP BY` that usually takes ~25 seconds never actually seem to manage to complete).

I'm curious whether I've been doing something inefficiently that can be optimized and / or to see if there is any interest in such an optimisation.

Thanks a lot!

PS: I did not know if I should have tagged the issue as RFC?

LGDHuaOPER commented 1 month ago

That sounds good! support!

tomershafir commented 1 month ago

Checkout https://github.com/metrico/qryn https://github.com/metrico/otel-collector

yawnt commented 3 weeks ago

@tomershafir Thank you for that, I was indeed aware of Qryn but it's a bit different because it is a separate product that needs to be installed and operated in addition to ClickHouse. I was suggesting a more native integration given the Otel focus on profiling.