Open michae2 opened 1 year ago
Nice find!
Even though the bundle said max memory usage was 67 MiB, we actually saw a spike > 1 GiB.
Max memory usage reported by the bundle only includes the execution statistics collected during the query run about the memory usage of operators in the plan, but we never include the auxiliary stuff like trace data unmarshalling in that number, so it's expected for this memory to not be included into the diagram. We did add accounting for LeafTxnFinalState
metadata in #85285, and perhaps we should extend that metadata accounting for all metadata types.
If I'm reading the comments here https://github.com/cockroachdb/cockroach/blob/2b66e5ed0f16c9fc7d21df2c575265ccb127ccb7/pkg/util/tracing/tracer.go#L53-L57 right, then it's expected that the log messages for a single trace could take up to 256MB in size, so it appears that we're not respecting these two limits (or perhaps there is additional overhead due to serialization / deserialization logs into protobuf). We might need some guidance from Obs Infra to know whether this behavior is expected or not, cc @dhartunian @abarganier
From triage meeting:
I spent some time playing around with Michael's reproduction on a single node cluster and collecting heap profiles. Some interesting finds:
when looking at inuse_space
right after bundle collection, the conversion of the trace to JaegerJSON is non-trivial
when looking at alloc_space
(since the node started, after having collected the bundle a few times) txnSpanRefresher
logging is the largest memory allocator
Could we benefit here by adding a pool for execinfrapb.ProducerMessage
structs so that execinfrapb.distSQLFlowStreamServer
doesn't have to allocate new ones in RecV()
? That way we could reuse memory for fields like tracingpb.RecordedSpan.Logs
that are contributing to the heap profiles we've seen.
Could we benefit here by adding a pool for execinfrapb.ProducerMessage structs so that execinfrapb.distSQLFlowStreamServer doesn't have to allocate new ones in RecV()? That way we could reuse memory for fields like tracingpb.RecordedSpan.Logs that are contributing to the heap profiles we've seen.
I'm not sure if we have a precedent for this (we'd need to modify how the protobuf generated code is generated, or perhaps disable generation of Recv
method to write it manually, maybe also ServerStream.RecvMsg
).
However, I'm a bit worried about sync-pooling some of these large tracingpb.RecordedSpan.Logs
objects - presumably, we'd use them very rarely (when an expensive query with lots of KV requests is being traced), yet we could keep the sync-pooled objects in memory for non-trivial amount of time, significantly increasing RSS. I think we should be tackling this problem from a different angle - we should not be creating these large log messages / large traces in the first place (by either truncating or dropping stuff altogether).
I believe the spike in memory on the gateway node is due to unmarshaling of this log message in traces:
I'm just driving by, but are we talking about the right problem? I doubt anyone on KV would object to significantly clamping down on this log statement. We should never be logging unbounded amounts of data.
It's difficult to prevent creation of large log messages, but perhaps via logging telemetry (sentry?) we could highlight locations for which a message of >256kb (arbitrary threshold) was ever seen, and at least take measures after the fact.
It looks like that log message should already be truncated to roughly 1KiB, and IIUC the problem is that we have too many log messages which add up.
At the end of statement diagnostics bundle collection of a query with a large lookup or index join, there can be a large spike in memory usage. Sometimes this is enough to OOM a node. Here's a demonstration using a real cluster:
Here's how that looked in the metrics:
Even though the bundle said max memory usage was 67 MiB, we actually saw a spike > 1 GiB.
This is very similar to https://github.com/cockroachdb/cockroach/issues/90739 and may even be the same root cause (verbose tracing) but I wanted to document the spike in memory usage. We should try to fix this spike even if statement bundle collection takes longer than normal execution.
I believe the spike in memory on the gateway node is due to unmarshaling of this log message in traces: https://github.com/cockroachdb/cockroach/blob/190aa543fae4b1c9cea34168575e54104deb62aa/pkg/kv/kvserver/replica_evaluate.go#L550-L551
I have reproduced this on both v23.1.0 and v21.2.17.
Jira issue: CRDB-27960