cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

sql: investigate out of memory issues on TPCC check 3.3.2.6 #47205

Closed jordanlewis closed 4 years ago

jordanlewis commented 4 years ago

The OOM crash can be reliably reproduced on a 3 node roachprod cluster with default n1-standard-4 nodes after importing 500 warehouses and running the query 3.3.2.6 manually (so the alteration of primary key is not needed). I tried to add more logging to try to understand where we're not accounting for the memory properly but still don't see it.

Here is an explain analyze (on an occasion it didn't crash) of the query for the context on amount of data flying around:

Screen Shot 2020-04-03 at 8 13 10 PM

We seem to be estimating the memory usage of the aggregators relatively good now. For example, here is one of the heap profiles when first stage aggregator finishes and the second one is about to:

Screen Shot 2020-04-03 at 8 22 32 PM

And at about this time in the logs we have:

[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-mem-5: bytes usage increases to 1.0 GiB
[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-funcs-mem-3, bytes usage max 170 MiB
[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-mem-3, bytes usage max 1.9 GiB
[n2,client=127.0.0.1:56166,hostnossl,user=root] aggregator-funcs-mem-5: bytes usage increases to 128 MiB (+10240)

In all of the crashes I've observed we successfully get to the point where hash join is being performed and actually get pretty far in the join's execution. However, RSS jumps quite significantly (say from 13 GiB RSS, 203 goroutines, 7.3 GiB/2.5 GiB/10 GiB GO alloc/idle/total to 14 GiB RSS, 201 goroutines, 9.2 GiB/720 MiB/10 GiB GO alloc/idle/total in 10 seconds), but this jump is not reflected in the heap profiles. It is possible that OS reports RSS with a delay, but I'm very puzzled by this.

I'm tired of looking at this issue and wondering whether someone else should take a stab at it. Here are a few suggestions:

  1. I added some logging and pushed the changes to https://github.com/yuzefovich/cockroach/tree/debug-alterpk-tpcc
  2. before running the query I disable auto stats and set some verbosity (on each node):
    set cluster setting sql.stats.automatic_collection.enabled=false;
    select crdb_internal.set_vmodule('hashjoiner=1,bytes_usage=1,hash_row_container=2,row_based_flow=1');
  3. I start Cockroach cluster with custom memory profile interval
    roachprod start yahor-test:1-3 --env="COCKROACH_MEMPROF_INTERVAL=3s"
yuzefovich commented 4 years ago

As I posted in the slack, this crash is a regression on release-19.2 after v19.2.5 was tagged. I narrowed down the culprit - I'm certain it is introduction of memory monitoring for some aggregate builtins (namely for anyNotNullAggregate) by #46622 and #46817.

Here are the logs I get with those two commits reverted (at e314eba32e77):

I200414 03:49:46.646055 3203 sql/rowexec/aggregator.go:686  [n2] hash aggregator is done
I200414 03:49:47.296875 193 server/status/runtime.go:498  [n2] runtime stats: 9.8 GiB RSS, 236 goroutines, 6.1 GiB/44 MiB/6.5 GiB GO alloc/idle/total, 2.9 GiB/3.7 GiB CGO alloc/total, 2280.9 CGO/sec, 224.0/48.0 %(u/s)time, 0.0 %gc (0x), 3.2 MiB/5.9 MiB (r/w)net
I200414 03:49:47.682110 3205 sql/rowexec/hashjoiner.go:770  [n2] hash joiner initialized stored side
I200414 03:49:50.300227 193 server/status/runtime.go:498  [n2] runtime stats: 10 GiB RSS, 237 goroutines, 6.1 GiB/44 MiB/6.5 GiB GO alloc/idle/total(stale), 2.9 GiB/3.7 GiB CGO alloc/total, 2080.3 CGO/sec, 308.0/43.3 %(u/s)time, 0.0 %gc (0x), 6.9 MiB/4.9 MiB (r/w)net
I200414 03:49:52.303175 193 server/status/runtime.go:498  [n2] runtime stats: 10 GiB RSS, 237 goroutines, 6.1 GiB/44 MiB/6.5 GiB GO alloc/idle/total(stale), 2.9 GiB/3.7 GiB CGO alloc/total, 2896.2 CGO/sec, 323.0/33.5 %(u/s)time, 0.0 %gc (0x), 2.4 MiB/2.0 MiB (r/w)net
I200414 03:49:54.307698 193 server/status/runtime.go:498  [n2] runtime stats: 11 GiB RSS, 237 goroutines, 6.1 GiB/44 MiB/6.5 GiB GO alloc/idle/total(stale), 2.9 GiB/3.7 GiB CGO alloc/total, 2802.2 CGO/sec, 319.8/34.4 %(u/s)time, 0.0 %gc (0x), 2.5 MiB/2.0 MiB (r/w)net
I200414 03:49:56.307770 193 server/status/runtime.go:498  [n2] runtime stats: 11 GiB RSS, 237 goroutines, 6.1 GiB/44 MiB/6.5 GiB GO alloc/idle/total(stale), 2.9 GiB/3.7 GiB CGO alloc/total, 2800.9 CGO/sec, 324.5/33.5 %(u/s)time, 0.0 %gc (0x), 2.3 MiB/2.1 MiB (r/w)net
I200414 03:49:58.312898 193 server/status/runtime.go:498  [n2] runtime stats: 11 GiB RSS, 237 goroutines, 6.1 GiB/44 MiB/6.5 GiB GO alloc/idle/total(stale), 2.9 GiB/3.7 GiB CGO alloc/total, 2823.8 CGO/sec, 314.7/36.4 %(u/s)time, 0.0 %gc (0x), 2.0 MiB/2.4 MiB (r/w)net
I200414 03:50:00.318257 193 server/status/runtime.go:498  [n2] runtime stats: 11 GiB RSS, 245 goroutines, 6.1 GiB/44 MiB/6.5 GiB GO alloc/idle/total(stale), 2.9 GiB/3.7 GiB CGO alloc/total, 2875.8 CGO/sec, 334.1/41.9 %(u/s)time, 0.0 %gc (0x), 2.2 MiB/2.0 MiB (r/w)net
I200414 03:50:02.315234 193 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 237 goroutines, 6.1 GiB/44 MiB/6.5 GiB GO alloc/idle/total(stale), 2.9 GiB/3.7 GiB CGO alloc/total, 2828.8 CGO/sec, 308.5/40.6 %(u/s)time, 0.0 %gc (0x), 1.9 MiB/2.3 MiB (r/w)net
I200414 03:50:04.321622 193 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 237 goroutines, 6.1 GiB/44 MiB/6.5 GiB GO alloc/idle/total(stale), 2.9 GiB/3.7 GiB CGO alloc/total, 2835.5 CGO/sec, 306.5/41.9 %(u/s)time, 0.0 %gc (0x), 2.4 MiB/2.4 MiB (r/w)net
I200414 03:50:06.322620 193 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 237 goroutines, 6.1 GiB/44 MiB/6.5 GiB GO alloc/idle/total(stale), 2.9 GiB/3.7 GiB CGO alloc/total, 2790.1 CGO/sec, 308.9/34.5 %(u/s)time, 0.0 %gc (0x), 2.0 MiB/2.3 MiB (r/w)net
I200414 03:50:08.323787 193 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 237 goroutines, 6.1 GiB/44 MiB/6.5 GiB GO alloc/idle/total(stale), 2.9 GiB/3.7 GiB CGO alloc/total, 2903.8 CGO/sec, 308.8/39.0 %(u/s)time, 0.0 %gc (0x), 4.0 MiB/2.4 MiB (r/w)net
I200414 03:50:09.324277 193 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 236 goroutines, 5.9 GiB/2.4 GiB/8.8 GiB GO alloc/idle/total, 3.0 GiB/3.7 GiB CGO alloc/total, 6174.6 CGO/sec, 321.1/23.0 %(u/s)time, 0.0 %gc (1x), 2.4 MiB/519 KiB (r/w)net
I200414 03:50:11.323325 193 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 234 goroutines, 6.1 GiB/2.2 GiB/8.8 GiB GO alloc/idle/total, 2.9 GiB/3.7 GiB CGO alloc/total, 39107.9 CGO/sec, 267.0/20.0 %(u/s)time, 0.0 %gc (0x), 2.9 MiB/1.8 MiB (r/w)net
I200414 03:50:13.323063 193 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 234 goroutines, 6.4 GiB/1.9 GiB/8.8 GiB GO alloc/idle/total, 2.9 GiB/3.7 GiB CGO alloc/total, 137215.9 CGO/sec, 250.5/34.5 %(u/s)time, 0.0 %gc (0x), 2.1 MiB/3.6 MiB (r/w)net
I200414 03:50:15.323856 193 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 234 goroutines, 6.6 GiB/1.7 GiB/8.8 GiB GO alloc/idle/total, 2.9 GiB/3.7 GiB CGO alloc/total, 128360.6 CGO/sec, 226.9/36.0 %(u/s)time, 0.0 %gc (0x), 2.2 MiB/3.7 MiB (r/w)net
I200414 03:50:15.368359 3204 sql/rowexec/aggregator.go:686  [n2] hash aggregator is done
I200414 03:50:17.323555 193 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 224 goroutines, 6.8 GiB/1.6 GiB/8.8 GiB GO alloc/idle/total, 2.9 GiB/3.7 GiB CGO alloc/total, 208168.4 CGO/sec, 138.5/12.0 %(u/s)time, 0.0 %gc (0x), 1.2 MiB/306 KiB (r/w)net
I200414 03:50:18.763608 3205 sql/rowexec/hashjoiner.go:409  [n2] hash joiner consumed stored side
I200414 03:50:19.323742 193 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 223 goroutines, 6.9 GiB/1.4 GiB/8.8 GiB GO alloc/idle/total, 2.9 GiB/3.7 GiB CGO alloc/total, 247824.1 CGO/sec, 161.0/11.5 %(u/s)time, 0.0 %gc (0x), 368 KiB/141 KiB (r/w)net
I200414 03:50:21.323765 193 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 223 goroutines, 7.0 GiB/1.3 GiB/8.8 GiB GO alloc/idle/total, 2.9 GiB/3.7 GiB CGO alloc/total, 114732.5 CGO/sec, 200.0/10.0 %(u/s)time, 0.0 %gc (0x), 364 KiB/141 KiB (r/w)net
I200414 03:50:23.324014 193 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 223 goroutines, 7.2 GiB/1.1 GiB/8.9 GiB GO alloc/idle/total, 3.0 GiB/3.8 GiB CGO alloc/total, 366028.8 CGO/sec, 236.5/9.0 %(u/s)time, 0.0 %gc (0x), 372 KiB/146 KiB (r/w)net
I200414 03:50:25.317743 3205 sql/rowexec/hashjoiner.go:443  [n2] read probe row 64
...

and without the revert:

I200414 03:55:26.495523 3213 sql/rowexec/aggregator.go:706  [n2] hash aggregator is done
I200414 03:55:28.058395 178 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 240 goroutines, 6.1 GiB/8.2 MiB/6.5 GiB GO alloc/idle/total(stale), 3.7 GiB/4.7 GiB CGO alloc/total, 2265.8 CGO/sec, 339.3/43.0 %(u/s)time, 0.0 %gc (0x), 4.7 MiB/1.1 MiB (r/w)net
I200414 03:55:28.774948 3215 sql/rowexec/hashjoiner.go:760  [n2] hash joiner initialized stored side
I200414 03:55:30.081823 178 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 240 goroutines, 6.1 GiB/8.2 MiB/6.5 GiB GO alloc/idle/total(stale), 3.7 GiB/4.8 GiB CGO alloc/total, 2531.8 CGO/sec, 306.9/52.4 %(u/s)time, 0.0 %gc (0x), 6.3 MiB/2.7 MiB (r/w)net
I200414 03:55:32.076264 178 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 240 goroutines, 6.1 GiB/8.2 MiB/6.5 GiB GO alloc/idle/total(stale), 3.7 GiB/4.8 GiB CGO alloc/total, 3083.6 CGO/sec, 325.9/32.6 %(u/s)time, 0.0 %gc (0x), 2.2 MiB/2.2 MiB (r/w)net
I200414 03:55:34.066268 178 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 239 goroutines, 7.8 GiB/5.6 MiB/8.3 GiB GO alloc/idle/total, 3.7 GiB/4.8 GiB CGO alloc/total, 3019.0 CGO/sec, 313.1/32.2 %(u/s)time, 0.0 %gc (1x), 3.1 MiB/2.8 MiB (r/w)net
I200414 03:55:35.066522 178 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 239 goroutines, 6.4 GiB/1.4 GiB/8.3 GiB GO alloc/idle/total, 3.7 GiB/4.8 GiB CGO alloc/total, 5393.7 CGO/sec, 267.9/21.0 %(u/s)time, 0.0 %gc (0x), 1.8 MiB/2.5 MiB (r/w)net
I200414 03:55:37.066158 178 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 240 goroutines, 6.7 GiB/1.1 GiB/8.3 GiB GO alloc/idle/total, 3.8 GiB/4.8 GiB CGO alloc/total, 3242.0 CGO/sec, 257.5/30.5 %(u/s)time, 0.0 %gc (0x), 3.3 MiB/4.8 MiB (r/w)net
I200414 03:55:39.066175 178 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 240 goroutines, 7.1 GiB/700 MiB/8.3 GiB GO alloc/idle/total, 3.7 GiB/4.8 GiB CGO alloc/total, 3524.5 CGO/sec, 275.0/22.0 %(u/s)time, 0.0 %gc (0x), 3.1 MiB/3.5 MiB (r/w)net
I200414 03:55:41.066099 178 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 240 goroutines, 7.4 GiB/378 MiB/8.3 GiB GO alloc/idle/total, 3.8 GiB/4.8 GiB CGO alloc/total, 3997.1 CGO/sec, 222.5/16.0 %(u/s)time, 0.0 %gc (0x), 2.4 MiB/3.1 MiB (r/w)net
I200414 03:55:43.066845 178 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 240 goroutines, 7.6 GiB/168 MiB/8.3 GiB GO alloc/idle/total, 3.7 GiB/4.8 GiB CGO alloc/total, 129417.4 CGO/sec, 264.9/19.0 %(u/s)time, 0.0 %gc (0x), 1.3 MiB/1.8 MiB (r/w)net
I200414 03:55:45.066965 178 server/status/runtime.go:498  [n2] runtime stats: 13 GiB RSS, 240 goroutines, 7.9 GiB/58 MiB/8.4 GiB GO alloc/idle/total, 3.7 GiB/4.8 GiB CGO alloc/total, 135542.1 CGO/sec, 221.5/28.0 %(u/s)time, 0.0 %gc (0x), 1.6 MiB/2.7 MiB (r/w)net
I200414 03:55:47.067065 178 server/status/runtime.go:498  [n2] runtime stats: 13 GiB RSS, 240 goroutines, 8.1 GiB/15 MiB/8.6 GiB GO alloc/idle/total, 3.8 GiB/4.8 GiB CGO alloc/total, 116029.3 CGO/sec, 220.0/34.5 %(u/s)time, 0.0 %gc (0x), 1.8 MiB/2.8 MiB (r/w)net
I200414 03:55:49.067216 178 server/status/runtime.go:498  [n2] runtime stats: 13 GiB RSS, 242 goroutines, 8.3 GiB/61 MiB/8.9 GiB GO alloc/idle/total, 3.7 GiB/4.8 GiB CGO alloc/total, 234436.5 CGO/sec, 234.5/32.5 %(u/s)time, 0.0 %gc (0x), 1.0 MiB/2.0 MiB (r/w)net
I200414 03:55:51.067325 178 server/status/runtime.go:498  [n2] runtime stats: 13 GiB RSS, 242 goroutines, 8.6 GiB/41 MiB/9.1 GiB GO alloc/idle/total, 3.7 GiB/4.8 GiB CGO alloc/total, 757796.1 CGO/sec, 236.5/26.5 %(u/s)time, 0.0 %gc (0x), 805 KiB/1.8 MiB (r/w)net
I200414 03:55:52.087247 3214 sql/rowexec/aggregator.go:706  [n2] hash aggregator is done
I200414 03:55:53.067311 178 server/status/runtime.go:498  [n2] runtime stats: 13 GiB RSS, 235 goroutines, 8.8 GiB/26 MiB/9.3 GiB GO alloc/idle/total, 3.8 GiB/4.8 GiB CGO alloc/total, 177088.1 CGO/sec, 187.0/29.5 %(u/s)time, 0.0 %gc (0x), 1.8 MiB/2.1 MiB (r/w)net
I200414 03:55:55.067029 178 server/status/runtime.go:498  [n2] runtime stats: 14 GiB RSS, 230 goroutines, 8.8 GiB/56 MiB/9.4 GiB GO alloc/idle/total, 3.7 GiB/4.8 GiB CGO alloc/total, 2639.4 CGO/sec, 155.5/18.5 %(u/s)time, 0.0 %gc (0x), 2.1 MiB/1.5 MiB (r/w)net
I200414 03:55:57.067169 178 server/status/runtime.go:498  [n2] runtime stats: 14 GiB RSS, 228 goroutines, 8.9 GiB/6.8 MiB/9.4 GiB GO alloc/idle/total, 3.7 GiB/4.8 GiB CGO alloc/total, 2455.3 CGO/sec, 70.5/10.0 %(u/s)time, 0.0 %gc (0x), 1.4 MiB/391 KiB (r/w)net
I200414 03:55:59.067183 178 server/status/runtime.go:498  [n2] runtime stats: 14 GiB RSS, 228 goroutines, 8.9 GiB/21 MiB/9.5 GiB GO alloc/idle/total, 3.7 GiB/4.8 GiB CGO alloc/total, 2459.5 CGO/sec, 31.5/11.5 %(u/s)time, 0.0 %gc (0x), 1.4 MiB/391 KiB (r/w)net
I200414 03:56:01.067613 178 server/status/runtime.go:498  [n2] runtime stats: 14 GiB RSS, 228 goroutines, 9.0 GiB/34 MiB/9.5 GiB GO alloc/idle/total, 3.7 GiB/4.8 GiB CGO alloc/total, 2620.0 CGO/sec, 35.5/10.5 %(u/s)time, 0.0 %gc (0x), 1.4 MiB/427 KiB (r/w)net
I200414 03:56:01.130543 3215 sql/rowexec/hashjoiner.go:406  [n2] hash joiner consumed stored side
I200414 03:56:03.067551 178 server/status/runtime.go:498  [n2] runtime stats: 14 GiB RSS, 226 goroutines, 9.3 GiB/36 MiB/9.8 GiB GO alloc/idle/total, 3.7 GiB/4.8 GiB CGO alloc/total, 388438.8 CGO/sec, 95.0/13.5 %(u/s)time, 0.0 %gc (0x), 373 KiB/279 KiB (r/w)net
fatal error: runtime: out of memory

The logs seem to suggest that there is a memory leak in those two commits (the hash aggregator doesn't seem to release some of its memory), but I don't see what has changed other us calling datum.Size().

yuzefovich commented 4 years ago

I think I figured it out. We need this:

diff --git a/pkg/sql/rowexec/aggregator.go b/pkg/sql/rowexec/aggregator.go
index 7cca780f4f..40cb689a24 100644
--- a/pkg/sql/rowexec/aggregator.go
+++ b/pkg/sql/rowexec/aggregator.go
@@ -636,7 +636,9 @@ func (ag *hashAggregator) emitRow() (
        bucket := ag.bucketsIter[0]
        ag.bucketsIter = ag.bucketsIter[1:]

-       return ag.getAggResults(ag.buckets[bucket])
+       state, row, meta := ag.getAggResults(ag.buckets[bucket])
+       delete(ag.buckets, bucket)
+       return state, row, meta
 }

The thing is that we increased the size of anyNotNullAggregate struct by embedding another struct that tracks memory. However, we keep map of buckets which contains like 500k of those structs and we do not release references to them until the flow is shutdown. Without the memory accounting, the query succeeds, but it runs very close to the crashing point, and the addition of those extra structs was just enough to tip the fragile balance over.