sql: OOM risk of EXPLAIN ANALYZE (DEBUG) of statement with large lookup or index join

michae2 commented 1 year ago

At the end of statement diagnostics bundle collection of a query with a large lookup or index join, there can be a large spike in memory usage. Sometimes this is enough to OOM a node. Here's a demonstration using a real cluster:

-- create a table with a secondary index and at least one leaseholder range per node
CREATE TABLE abc (a STRING, b STRING, c STRING, INDEX (a, b));
ALTER TABLE abc CONFIGURE ZONE USING range_max_bytes = 33554432, range_min_bytes = 1048576;

-- populate
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);
INSERT INTO abc SELECT i::string, i::string, i::string FROM generate_series(0, 99999) s(i);

ANALYZE abc;
SHOW RANGES FROM TABLE abc;

-- use a vectorized, distributed query with 2.6m rows passing through an index join
EXPLAIN SELECT COUNT(DISTINCT c) FROM abc@abc_a_b_idx WHERE a > '2' AND b > '2' AND c > '2';

-- normal execution takes a few seconds and only has a small impact on total node memory usage
SELECT now(); SELECT COUNT(DISTINCT c) FROM abc@abc_a_b_idx WHERE a > '2' AND b > '2' AND c > '2';
SELECT now();

-- execution with analyze (statistics collection) takes 10 second and again has a small impact on total node memory usage
SELECT now(); EXPLAIN ANALYZE SELECT COUNT(DISTINCT c) FROM abc@abc_a_b_idx WHERE a > '2' AND b > '2' AND c > '2';
SELECT now();

-- execution with verbose tracing takes over 3 minutes and causes a large spike in CPU and memory usage
SELECT now(); EXPLAIN ANALYZE (DEBUG) SELECT COUNT(DISTINCT c) FROM abc@abc_a_b_idx WHERE a > '2' AND b > '2' AND c > '2';
SELECT now();

Here's how that looked in the metrics:

Even though the bundle said max memory usage was 67 MiB, we actually saw a spike > 1 GiB.

This is very similar to https://github.com/cockroachdb/cockroach/issues/90739 and may even be the same root cause (verbose tracing) but I wanted to document the spike in memory usage. We should try to fix this spike even if statement bundle collection takes longer than normal execution.

I believe the spike in memory on the gateway node is due to unmarshaling of this log message in traces: https://github.com/cockroachdb/cockroach/blob/190aa543fae4b1c9cea34168575e54104deb62aa/pkg/kv/kvserver/replica_evaluate.go#L550-L551

I have reproduced this on both v23.1.0 and v21.2.17.

Jira issue: CRDB-27960

yuzefovich commented 1 year ago

Nice find!

Even though the bundle said max memory usage was 67 MiB, we actually saw a spike > 1 GiB.

Max memory usage reported by the bundle only includes the execution statistics collected during the query run about the memory usage of operators in the plan, but we never include the auxiliary stuff like trace data unmarshalling in that number, so it's expected for this memory to not be included into the diagram. We did add accounting for LeafTxnFinalState metadata in #85285, and perhaps we should extend that metadata accounting for all metadata types.

yuzefovich commented 1 year ago

If I'm reading the comments here https://github.com/cockroachdb/cockroach/blob/2b66e5ed0f16c9fc7d21df2c575265ccb127ccb7/pkg/util/tracing/tracer.go#L53-L57 right, then it's expected that the log messages for a single trace could take up to 256MB in size, so it appears that we're not respecting these two limits (or perhaps there is additional overhead due to serialization / deserialization logs into protobuf). We might need some guidance from Obs Infra to know whether this behavior is expected or not, cc @dhartunian @abarganier

rytaft commented 1 year ago

From triage meeting:

The real fix should be done by observability infra
SQL Queries might be able to add some memory accounting
A quick fix: temporary accounting when we receive the trace data on the inbox side, and release it when inbox exits

yuzefovich commented 1 year ago

I spent some time playing around with Michael's reproduction on a single node cluster and collecting heap profiles. Some interesting finds:

when looking at inuse_space right after bundle collection, the conversion of the trace to JaegerJSON is non-trivial
when looking at alloc_space (since the node started, after having collected the bundle a few times) txnSpanRefresher logging is the largest memory allocator

DrewKimball commented 1 year ago

Could we benefit here by adding a pool for execinfrapb.ProducerMessage structs so that execinfrapb.distSQLFlowStreamServer doesn't have to allocate new ones in RecV()? That way we could reuse memory for fields like tracingpb.RecordedSpan.Logs that are contributing to the heap profiles we've seen.

yuzefovich commented 1 year ago

Could we benefit here by adding a pool for execinfrapb.ProducerMessage structs so that execinfrapb.distSQLFlowStreamServer doesn't have to allocate new ones in RecV()? That way we could reuse memory for fields like tracingpb.RecordedSpan.Logs that are contributing to the heap profiles we've seen.

I'm not sure if we have a precedent for this (we'd need to modify how the protobuf generated code is generated, or perhaps disable generation of Recv method to write it manually, maybe also ServerStream.RecvMsg).

However, I'm a bit worried about sync-pooling some of these large tracingpb.RecordedSpan.Logs objects - presumably, we'd use them very rarely (when an expensive query with lots of KV requests is being traced), yet we could keep the sync-pooled objects in memory for non-trivial amount of time, significantly increasing RSS. I think we should be tackling this problem from a different angle - we should not be creating these large log messages / large traces in the first place (by either truncating or dropping stuff altogether).

tbg commented 1 year ago

I believe the spike in memory on the gateway node is due to unmarshaling of this log message in traces:

I'm just driving by, but are we talking about the right problem? I doubt anyone on KV would object to significantly clamping down on this log statement. We should never be logging unbounded amounts of data.

It's difficult to prevent creation of large log messages, but perhaps via logging telemetry (sentry?) we could highlight locations for which a message of >256kb (arbitrary threshold) was ever seen, and at least take measures after the fact.

yuzefovich commented 1 year ago

It looks like that log message should already be truncated to roughly 1KiB, and IIUC the problem is that we have too many log messages which add up.

cockroachdb / cockroach

sql: OOM risk of EXPLAIN ANALYZE (DEBUG) of statement with large lookup or index join #103358