Continuous profiling: queries and flamegraphs

yawnt commented 1 month ago

Use case

Efficiently querying continuous profiling data in ClickHouse to drive flamegraph visualization would be a great addition to the logging, metrics and traces use case - especially in light of OpenTelemetry's decision to officially support it

Describe the solution you'd like

The most popular OSS tool is Pyroscope by Grafana Labs, so something similar to what is being done for Prometheus / PromQL would unlock migration opportunities for customers to move all of their observability metrics into ClickHouse
Alternatively, extending the flameGraph function to support bigger trees, non-system tables and different output formats would also achieve similar results

Describe alternatives you've considered

I've loaded up around ~3.29 billion profiler samples (~30 days) in a ClickHouse instance to generate flamegraphs. There are a few visualizers for flamegraphs (speedscope, flamegraph.pl), but I focused on Grafana's flame graph panel because:

the format is quite compact compared to ie: flamegraph.pl: it's a depth-first traversal of the flamegraph with 4 fields: label, level, total, and self - with self being the cumulative sum of the box minus the cumulative sum of its children
it's very interactive compared to a static svg
ClickHouse Grafana plugin already supports Observability Traces and Logs, and Altinity's ClickHouse Grafana plugin also has added support for Flamegraphs (https://github.com/Altinity/clickhouse-grafana/pull/539)
Grafana is quite popular :)

I tried three different approaches:

Pure SQL

Works quite well actually! Starts to struggle once we hit a few days of profiling data because of the explosion caused by the `arrayJoin` and the `ORDER BY` and `GROUP BY` are high-cardinality due to the unique sub paths. ```SQL CREATE TABLE profilerSamples( timestamp DateTime, sampleCount UInt64, stackFrames Array(String), -- Could be a UUID / UInt128 and joined via a dictionary label1 String, label2 String ) ENGINE = MergeTree() -- Could be a Summing/AggregatingMergeTree ORDER BY (label1, label2, toStartOfMinute(timestamp)) ``` ```SQL WITH groupedProfilerSamples AS ( -- Reduces number of rows by grouping same stack frames together SELECT stackFrames, sum(sampleCount) as sampleCount FROM profilerSamples WHERE timestamp BETWEEN '' AND '' AND label1 = '' AND label2 = '' GROUP BY stackFrames ), summedProfilerSamples AS ( -- Explodes into individual flame graph boxes -- e.g.: -- stack=[a,b] samples=100 -- stack=[a,b,c] samples=200 -- stack=[e] samples=300 -- becomes -- subpath=[a] samples=100 self=false -- subpath=[a,b] samples=100 self=true -- subpath=[a] samples=200 self=false -- subpath=[a,b] samples=200 self=false -- subpath=[a,b,c] samples=200 self=true -- subpath=[e] samples=300 self=true SELECT arrayResize(stackFrames, stackLevel) AS stackSubpath, sum(samplesCount) AS samplesCount, arrayJoin(arrayEnumerate(stackFrames)) AS stackLevel, length(stackFrames) = stackLevel AS isSelf FROM groupedProfilerSamples ) SELECT 0 AS level, sum(samplesCount) AS value, 'All' AS label, 0 AS self FROM groupedProfilerSamples UNION ALL -- the structure of `groupedProfilerSamples` allows us to group -- and sum the same boxes and then print the tree in depth-first -- order by sorting subpaths lexicographically SELECT length(stackSubpath) AS level, sum(samplesCount) AS value, stackSubpath[-1] AS label, sumIf(samplesCount, isSelf) AS self FROM groupedProfilerSamples GROUP BY stackSubpath ORDER BY stackSubpath ```

flameGraph Aggregate Function

I've tried running the `flameGraph` function against a modified table similar to the one detailed under "Pure SQL", but the fact that it stores children as a linked list means it slows down as the tree size grows. The `flamegraph.pl` format is also quite verbose due to the repetition of stack frames.

Executable Table Function

I've written a small C++ program which reads the equivalent of the `groupedProfilerSamples SELECT` query from "Pure SQL", ingests it into a tree structure which uses a map to hold children and prints the tree to `STDOUT` in depth-first order. It also works quite well when I try to read data exported via `INTO OUTFILE` in `RowBinary` format, but I've hit issue https://github.com/ClickHouse/ClickHouse/issues/66646 when running it directly inside ClickHouse (meaning my `GROUP BY` that usually takes ~25 seconds never actually seem to manage to complete).

I'm curious whether I've been doing something inefficiently that can be optimized and / or to see if there is any interest in such an optimisation.

Thanks a lot!

PS: I did not know if I should have tagged the issue as RFC?

LGDHuaOPER commented 1 month ago

That sounds good! support!

tomershafir commented 1 month ago

Checkout https://github.com/metrico/qryn https://github.com/metrico/otel-collector

yawnt commented 3 weeks ago

@tomershafir Thank you for that, I was indeed aware of Qryn but it's a bit different because it is a separate product that needs to be installed and operated in addition to ClickHouse. I was suggesting a more native integration given the Otel focus on profiling.

ClickHouse / ClickHouse

Continuous profiling: queries and flamegraphs #67615