kv: many large, concurrent INSERTs can overload intent resolution and lead to OOM

Describe the problem

Please describe the issue you observed, and any steps we can take to reproduce it:

There are many issues found when we were doing the tpcc data preparation with CockroachDB. This post only focuses on memory issues.

One of the CockroachDB pod was evicted by Kubernetes for exceeding the pod memory limit during the data preparation, therefore the pod get restarted. We assumed it was caused by the Pod QoS class, because we set pod’s memory request (16GB) far lower than the memory limit (26GB). When a node runs out of memory, Kubernetes is more likely to evict the Burstable pod. So we tried setting request=limit (26GB), but the container was OOMKilled after running the data preparation for a while.

The Go heap dump showed that CockroachDB only allocated 4~6GB (varies by run), plus the 26 GB * 25% = 6.5 GB we set aside for RocksDB/pebble cache. There is still around 13 GB headroom before witnessing OOM. We suspected that many unreachable garbage objects were undetected by heap profiling. So we tried lowering the GOGC parameter (which is set to 100 by default) to make CockroachDB more aggressive in cleaning up the garbage, but it didn’t work.

One of the other runs showed that the Go heap usage can reach 11GB, while the RocksDB cache (precisely CGo native) part can reach 15GB despite the 26 GB * 25% = 6.5 GB setting.

A similar issue was reported (*: node crash with OOM despite memory limits set) in 2017, but it was left unresolved.

To Reproduce

What did you do? Describe in your own words.

If possible, provide steps to reproduce the behavior:

Start a 3 nodes cockroach cluster on top of Kubernetes
create database sbtest

Run Percona's sysbench-tpcc benchmark like below:

kubectl --context ${clusterName} run tpcc -it --image=mzhou/sysbench-tpcc --restart=Never --rm /bin/bash
# ./tpcc.lua --pgsql-user=root --pgsql-db=sbtest --time=7200 --threads=128 --report-interval=1 --tables=128 --scale=256 --db-driver=pgsql --pgsql-port=26257 --pgsql-host=cockroachdb --use-fk=0 prepare

Monitor the memory usage from the Cockroach dashboard. The container should be killed within 10 minutes.

We chose to use sysbench-tpcc because we believe third-party testing tools will be more impartial.

Environment: Hardware

Stanard_fs16_v2 * 3 nodes in a single Azure region
Each node:
- 16 vCPU
- 32 GB RAM
- 512GB UltraSSD, IOPS: 5400, b/w: 300 MBps

Software

Cockroach version: 21.2.14, w/o security
Azure managed kubernetes
Each cockroach pod:
- Request=limit 14 CPU
- Request=limit 26 GB memory

Expected behavior

Cockroachdb process shouldn't crash
The heap usage shouldn't exceed 26 * 25% = 6.5 GB
The native memory usage shouldn't exceed 26 * 25% = 6.5 GB
The rest of the memory usage shouldn't cost more than 500MB

Jira issue: CRDB-13059

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@jordanlewis (commented on cockroachdb/cockroach#20609)
@knz (commented on cockroachdb/cockroach#20609)
@cockroachdb/storage (found keywords: RocksDB,pebble)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

This doc says that the memory usage of Cockroach can be divided into the following three parts.

Go allocated memory: Memory allocated by the Go runtime to support query processing and various caches maintained in Go by CockroachDB.
CGo allocated memory: Memory allocated by the C/C++ libraries linked into CockroachDB and primarily concerns the block caches for the Pebble storage engine). This is the allocation specified with --cache. The size of CGo allocated memory is usually very close to the configured --cache size.
Overhead: The RSS (resident set size) minus Go/CGo allocated memory.

The following is the memory usage dashboard of a crash node. Right before the crashing, the Go allocated part took 10.85GB (varies by run, the range is 6 ~11GB), which includes both live objects and unclaimed garbage. The CGo part took 6.77GB. There are still 24.55 - 10.85 - 6.77 = 6.93 GB gap taken by the overhead for this particular run.

After reproducing the crash many times, the following clues were found in the logs of oom-killer:

cockroach's RSS exceeds the set 26GB limit
oom often appears when allocating page cache

kernel: cockroach invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=-997
kernel: CPU: 10 PID: 4771 Comm: cockroach Not tainted 5.4.0-1067-azure #70~18.04.1-Ubuntu
kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 10/27/2020
kernel: Call Trace:
kernel:  dump_stack+0x57/0x6d
kernel:  dump_header+0x4f/0x200
kernel:  oom_kill_process+0xe6/0x120
kernel:  out_of_memory+0x117/0x540
kernel:  mem_cgroup_out_of_memory+0xbb/0xd0
kernel:  try_charge+0x6bf/0x7c0
kernel:  ? __alloc_pages_nodemask+0x153/0x320
kernel:  mem_cgroup_try_charge+0x75/0x190
kernel:  mem_cgroup_try_charge_delay+0x22/0x50
kernel:  __handle_mm_fault+0x943/0x1330
kernel:  ? schedule+0x33/0xa0
kernel:  handle_mm_fault+0xb7/0x200
kernel:  __do_page_fault+0x29c/0x4c0
kernel:  ? __switch_to_asm+0x34/0x70
kernel:  do_page_fault+0x35/0x110
kernel:  page_fault+0x39/0x40
kernel: RIP: 0010:copy_user_enhanced_fast_string+0xe/0x30
kernel: Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 01 ca c3 0f 1f 80 00 00 00 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3> a4 31 c0 0f 01 ca c3 66 2e 0f 1f 84 00 00 00 00 00 89 d1
kernel: RSP: 0018:ffff9ad704697c80 EFLAGS: 00050206
kernel: RAX: 00007f328f69a576 RBX: ffff8b157f19a000 RCX: 0000000000000576
kernel: RDX: 0000000000001000 RSI: ffff8b157f19aa8a RDI: 00007f328f69a000
kernel: RBP: ffff9ad704697c88 R08: ffff9ad704697e10 R09: 0000000000000001
kernel: R10: ffff9ad70e317b98 R11: 0000000000000003 R12: 0000000000001000
kernel: R13: ffff9ad704697e20 R14: 0000000000001000 R15: 000000000000aa96
kernel:  ? copyout+0x26/0x30
kernel:  copy_page_to_iter+0x179/0x310
kernel:  generic_file_read_iter+0x4bf/0xd30
kernel:  ? file_fdatawait_range+0x30/0x30
kernel:  ext4_file_read_iter+0x56/0xf0
kernel:  new_sync_read+0x122/0x1b0
kernel:  __vfs_read+0x29/0x40
kernel:  vfs_read+0x8e/0x130
kernel:  ksys_pread64+0x76/0x90
kernel:  __x64_sys_pread64+0x1e/0x20
kernel:  do_syscall_64+0x5e/0x200
kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
kernel: RIP: 0033:0x521caa
kernel: Code: e8 9b ec f9 ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 4c 8b 54 24 28 4c 8b 44 24 30 4c 8b 4c 24 38 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 40 ff ff ff ff 48 c7 44
kernel: RSP: 002b:000000c175088010 EFLAGS: 00000202 ORIG_RAX: 0000000000000011
kernel: RAX: ffffffffffffffda RBX: 000000c00008c000 RCX: 0000000000521caa
kernel: RDX: 0000000000077fca RSI: 00007f328f68eae0 RDI: 00000000000013a8
kernel: RBP: 000000c175088078 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 00000000007e556a R11: 0000000000000202 R12: 000000000000002c
kernel: R13: 000000000000002d R14: 000000000000005e R15: 00007f3290b98c1a
kernel: memory: usage 26214400kB, limit 26214400kB, failcnt 5967298
kernel: memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kernel: kmem: usage 164168kB, limit 9007199254740988kB, failcnt 0
kernel: Memory cgroup stats for /kubepods/pod88925984-2ff3-4d7a-8cf5-643d382a58b0:
kernel: anon 26668752896
        file 5009408
        kernel_stack 4939776
        slab 69935104
        sock 0
        shmem 0
        file_mapped 540672
        file_dirty 0
        file_writeback 0
        anon_thp 5022679040
        inactive_anon 0
        active_anon 26669301760
        inactive_file 3158016
        active_file 0
        unevictable 0
        slab_reclaimable 59940864
        slab_unreclaimable 9994240
        pgfault 251812214
        pgmajfault 0
        workingset_refault 3215933
        workingset_activate 524174
        workingset_nodereclaim 33
        pgrefill 2207339
        pgscan 22319161
        pgsteal 11889145
        pgactivate 139085
        pgdeactivate 2207173
        pglazyfree 0
        pglazyfreed 0
        thp_fault_alloc 127484
        thp_collapse_alloc 670
kernel: Tasks state (memory values in pages):
kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
kernel: [  13635] 65535 13635      241        1    28672        0          -998 pause
kernel: [  25226]     0 25226 12836589  6535336 94203904        0          -997 cockroach
kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=d99293475009fd398e2ea74fb6006aa41420356570599c59536a9c3832e29495,mems_allowed=0,oom_memcg=/kubepods/pod88925984-2ff3-4d7a-8cf5-643d382
kernel: Memory cgroup out of memory: Killed process 25226 (cockroach) total-vm:51346356kB, anon-rss:26028716kB, file-rss:112628kB, shmem-rss:0kB, UID:0 pgtables:91996kB oom_score_adj:-997
kernel: oom_reaper: reaped process 25226 (cockroach), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

According to the Linux kernel's cgroups documentation, when a page of the page cache is loaded into physical memory, it is accounted for in the RSS of the user process. In order to verify the hypotheis that page cache occupied the rest memory, the following experiment was done: run a task on the cockroach pod that periodically drops the page cache from the system. It turned out that oom-killing was mitigated.

Page cache

In order to figure out which part of the cockroach logic loads a lot of page cache, a perf profiling was done. The results showed that pebble always load sstable files into page cache. In our particular use case, there were heavy page cache loading when pebble was doing compactions. Cockroach read a lot of small sstables, did merge, then wrote them into large sstables. The whole process does not use direct I/O. Also, cockroach has no way to enable use_direct_io_for_flush_and_compaction when using the RocksDB engine. Below is the stack trace of one page load.

ffffffff9401be21 ondemand_readahead+0x1 ([kernel.kallsyms])
ffffffff940122fe generic_file_read_iter+0x85e ([kernel.kallsyms])
ffffffff94186d36 ext4_file_read_iter+0x56 ([kernel.kallsyms])
ffffffff940c57e2 new_sync_read+0x122 ([kernel.kallsyms])
ffffffff940c8999 __vfs_read+0x29 ([kernel.kallsyms])
ffffffff940c8a3e vfs_read+0x8e ([kernel.kallsyms])
ffffffff940c8de6 ksys_pread64+0x76 ([kernel.kallsyms])
ffffffff940c8e1e __x64_sys_pread64+0x1e ([kernel.kallsyms])
ffffffff93e0272e do_syscall_64+0x5e ([kernel.kallsyms])
ffffffff94a0008c entry_SYSCALL_64_after_hwframe+0x44 ([kernel.kallsyms])
521caa           syscall.Syscall6+0x2a (/cockroach/cockroach)
54165f           internal/poll.(*FD).Pread+0xdf (/cockroach/cockroach)
54ad74           os.(*File).ReadAt+0x114 (/cockroach/cockroach)
169c852          github.com/cockroachdb/pebble/vfs.(*enospcFile).ReadAt+0x72 (/cockroach/cockroach)
169b6a8          github.com/cockroachdb/pebble/vfs.(*fdFileWrapper).ReadAt+0x68 (/cockroach/cockroach)
16eb2ab          github.com/cockroachdb/pebble/sstable.(*Reader).readBlock+0x1cb (/cockroach/cockroach)
16e4009          github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).loadBlock+0x169 (/cockroach/cockroach)
16e7688          github.com/cockroachdb/pebble/sstable.(*compactionIterator).skipForward+0x148 (/cockroach/cockroach)
16e7497          github.com/cockroachdb/pebble/sstable.(*compactionIterator).Next+0x77 (/cockroach/cockroach)
1748f91          github.com/cockroachdb/pebble.(*levelIter).Next+0x1f1 (/cockroach/cockroach)
174cf3c          github.com/cockroachdb/pebble.(*mergingIter).nextEntry+0x7c (/cockroach/cockroach)
174f935          github.com/cockroachdb/pebble.(*mergingIter).Next+0x75 (/cockroach/cockroach)
171aa38          github.com/cockroachdb/pebble.(*compactionIter).iterNext+0x38 (/cockroach/cockroach)
171ab2f          github.com/cockroachdb/pebble.(*compactionIter).nextInStripe+0x2f (/cockroach/cockroach)
171af2e          github.com/cockroachdb/pebble.(*compactionIter).setNext+0x1ee (/cockroach/cockroach)
171a088          github.com/cockroachdb/pebble.(*compactionIter).Next+0x4a8 (/cockroach/cockroach)
1715196          github.com/cockroachdb/pebble.(*DB).runCompaction+0xb76 (/cockroach/cockroach)
1713d05          github.com/cockroachdb/pebble.(*DB).compact1+0x1c5 (/cockroach/cockroach)
176b392          github.com/cockroachdb/pebble.(*DB).compact.func1+0x92 (/cockroach/cockroach)
c4172f           runtime/pprof.Do+0xcf (/cockroach/cockroach)
1713b25          github.com/cockroachdb/pebble.(*DB).compact+0xa5 (/cockroach/cockroach)
4c5cc1           runtime.goexit+0x1 (/cockroach/cockroach)

Go allocated memory

Meanwhile, the memory usage of Go allocated part is much larger than expected. The following CockroachDB data structures occupy the majority of the heap, according to heap dumps:

LocksAsLockUpdates (*Transaction) returns a large number of LockUpdate objects.
The Pebble engine's (*Batch) Apply stores a vast amount of byte data.
Pebble allocates a lot of memory to Skiplist
In mvcc.go, (b *putBuffer) marshalMetabyte returns huge byte[]

See this image for detailed information.

Triage:

Sysbench-tpcc insert large bulks of data into stock tables. A single insertion can include up to 100K entries, or around 36MB. The load of these data requires around 4.6GB of RAM if 128 threads are all active. Cockroach consumes these bulks by allocating a large amount of heap space rather than operating them one-by-one like consuming a stream.

Cgo allocated memory

Although Pebble added a cache implementation to limit the size of Cgo allocated memory, we still noticed cases where the Cgo allocated part exceeded the limit. Below are the Cgo memory stats generated by CockroachDB.

memstats.2022-02-06T01_06_37.010.26758971392.txt:  "CGoAllocatedBytes": 13427655920,
memstats.2022-02-06T01_06_26.963.26393677824.txt:  "CGoAllocatedBytes": 12980342080,
memstats.2022-02-05T08_37_43.929.27757682688.txt:  "CGoAllocatedBytes": 12520257920,
memstats.2022-02-06T01_06_16.968.25532248064.txt:  "CGoAllocatedBytes": 12285956608,
memstats.2022-02-05T08_37_33.941.27373121536.txt:  "CGoAllocatedBytes": 12251580896,

Jemalloc profile only gives us the leaf function who allocated the memory:

Total: 7443.8 MB  
7443.3 100.0% 100.0%   7443.3 100.0% _cgo_82be0c43bdb2_Cfunc_calloc
     0.5   0.0% 100.0%      0.5   0.0% _GLOBAL__sub_I_eh_alloc.cc
     0.0   0.0% 100.0%      0.5   0.0% 00007ffa5b84441d
     0.0   0.0% 100.0%      0.5   0.0% __libc_csu_init
     0.0   0.0% 100.0%      0.5   0.0% _start
     0.0   0.0% 100.0%   7443.3 100.0% runtime.asmcgocall

Perf profiling shows that majority of the native memory allocation come from sstable reading:

100.00%   100.00%  (45858c0)
            |
            ---runtime.goexit
               |
               |--67.01%--github.com/cockroachdb/pebble.(*DB).compact
               |          runtime/pprof.Do
               |          github.com/cockroachdb/pebble.(*DB).compact.func1
               |          github.com/cockroachdb/pebble.(*DB).compact1
               |          github.com/cockroachdb/pebble.(*DB).runCompaction
               |          |
               |          |--66.05%--github.com/cockroachdb/pebble.(*compactionIter).Next
               |          |          |
               |          |          |--63.48%--github.com/cockroachdb/pebble.(*compactionIter).setNext
               |          |          |          github.com/cockroachdb/pebble.(*compactionIter).nextInStripe
               |          |          |          github.com/cockroachdb/pebble.(*compactionIter).iterNext
               |          |          |          github.com/cockroachdb/pebble.(*mergingIter).Next
               |          |          |          github.com/cockroachdb/pebble.(*mergingIter).nextEntry
               |          |          |          |
               |          |          |          |--39.49%--github.com/cockroachdb/pebble.(*levelIter).Next
               |          |          |          |          |
               |          |          |          |           --39.31%--github.com/cockroachdb/pebble/sstable.(*compactionIterator).Next
               |          |          |          |                     github.com/cockroachdb/pebble/sstable.(*compactionIterator).skipForward
               |          |          |          |                     github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).loadBlock
               |          |          |          |                     github.com/cockroachdb/pebble/sstable.(*Reader).readBlock
               |          |          |          |                     |
               |          |          |          |                     |--17.94%--github.com/cockroachdb/pebble/sstable.decompressBlock
               |          |          |          |                     |          github.com/cockroachdb/pebble/internal/cache.newValue
               |          |          |          |                     |          github.com/cockroachdb/pebble/internal/manual.New
               |          |          |          |                     |          github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc
               |          |          |          |                     |          _cgo_82be0c43bdb2_Cfunc_calloc
               |          |          |          |                     |
               |          |          |          |                     |--17.94%--github.com/cockroachdb/pebble/internal/cache.newValue
               |          |          |          |                     |          github.com/cockroachdb/pebble/internal/manual.New
               |          |          |          |                     |          github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc
               |          |          |          |                     |          _cgo_82be0c43bdb2_Cfunc_calloc
               |          |          |          |                     |
               |          |          |          |                      --3.42%--github.com/cockroachdb/pebble/internal/cache.(*Cache).Set
               |          |          |          |                                github.com/cockroachdb/pebble/internal/cache.(*shard).Set
               |          |          |          |                                github.com/cockroachdb/pebble/internal/cache.newEntry
               |          |          |          |                                github.com/cockroachdb/pebble/internal/cache.entryAllocNew
               |          |          |          |                                github.com/cockroachdb/pebble/internal/cache.(*entryAllocCache).alloc
               |          |          |          |                                github.com/cockroachdb/pebble/internal/manual.New
               |          |          |          |                                github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc
               |          |          |          |                                _cgo_82be0c43bdb2_Cfunc_calloc
               |          |          |          |
               |          |          |           --23.98%--github.com/cockroachdb/pebble/sstable.(*compactionIterator).Next
               |          |          |                     github.com/cockroachdb/pebble/sstable.(*compactionIterator).skipForward
               |          |          |                     github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).loadBlock
               |          |          |                     github.com/cockroachdb/pebble/sstable.(*Reader).readBlock
               |          |          |                     |
               |          |          |                     |--11.31%--github.com/cockroachdb/pebble/sstable.decompressBlock
               |          |          |                     |          github.com/cockroachdb/pebble/internal/cache.newValue
               |          |          |                     |          github.com/cockroachdb/pebble/internal/manual.New
               |          |          |                     |          github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc
               |          |          |                     |          _cgo_82be0c43bdb2_Cfunc_calloc
               |          |          |                     |
               |          |          |                     |--11.31%--github.com/cockroachdb/pebble/internal/cache.newValue
               |          |          |                     |          github.com/cockroachdb/pebble/internal/manual.New
               |          |          |                     |          github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc
               |          |          |                     |          _cgo_82be0c43bdb2_Cfunc_calloc
               |          |          |                     |
               |          |          |                      --1.36%--github.com/cockroachdb/pebble/internal/cache.(*Cache).Set
               |          |          |                                github.com/cockroachdb/pebble/internal/cache.(*shard).Set
               |          |          |                                github.com/cockroachdb/pebble/internal/cache.newEntry
               |          |          |                                github.com/cockroachdb/pebble/internal/cache.entryAllocNew
               |          |          |                                github.com/cockroachdb/pebble/internal/cache.(*entryAllocCache).alloc
               |          |          |                                github.com/cockroachdb/pebble/internal/manual.New
               |          |          |                                github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc
               |          |          |                                _cgo_82be0c43bdb2_Cfunc_calloc
        |
               |--24.29%--github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
               |          |
               |          |--19.29%--github.com/cockroachdb/cockroach/pkg/internal/client/requestbatcher.(*RequestBatcher).sendBatch.func1
               |          |          |
               |          |          |--17.42%--github.com/cockroachdb/cockroach/pkg/internal/client/requestbatcher.(*RequestBatcher).sendBatch.func1.1
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch
               |          |          |          github.com/cockroachdb/cockroach/pkg/rpc.(*internalClientAdapter).Batch
               |          |          |          github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch
               |          |          |          github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal
               |          |          |          github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr
               |          |          |          github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evalAndPropose
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).requestToProposal
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateProposal
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatch
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWithServersideRefreshes
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWrapper
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateBatch
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateCommand
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval.ResolveIntent
               |          |          |          github.com/cockroachdb/cockroach/pkg/storage.MVCCResolveWriteIntent
               |          |          |          |
               |          |          |           --17.41%--github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingIter).SeekIntentGE
               |          |          |                     |
               |          |          |                     |--9.71%--github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).SeekEngineKeyGEWithLimit
               |          |          |                     |          github.com/cockroachdb/pebble.(*Iterator).SeekPrefixGE
               |          |          |                     |          github.com/cockroachdb/pebble.(*mergingIter).SeekPrefixGE
               |          |          |                     |          github.com/cockroachdb/pebble.(*mergingIter).seekGE
               |          |          |                     |          github.com/cockroachdb/pebble.(*levelIter).SeekPrefixGE
               |          |          |                     |          |
               |          |          |                     |           --9.57%--github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).SeekPrefixGE
               |          |          |                     |                     github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).seekPrefixGE

Summary

Cockroach needs a sophisticated heap management system to limit the footprint of every single query it serves.
Cockroach should avoid memory leak in its native memory
Cockroach’s usage of the system page cache is unbounded. Cockroach should find a way to leverage direct I/O for flushing and compactions because those loaded pages are usually discarded soon. Also, the frequent page cache loading and unloading (if it's not oom killed) would significantly decrease the page cache hit ratio thus hurting the performance of regular queries, and sometimes even causing system thrashing.

The thrust of this issue is absolutely correct: cockroach should not OOM and should manage its heap usage well is completely legitimate. We've been fighting that fight for a long time (https://github.com/cockroachdb/cockroach/issues/10320) and will continue to fight that fight and improve out memory accounting. The heap profile you're providing here is very valuable. The go heap usage you see there related to intent resolution is a good find and the reproduction steps are much appreciated.

Clearly there's an issue here related to memory usage underneath the asynchronous intent resolver. Not excusing the OOM, it should never happen, I suspect that sysbench-tpcc may be running into some pathological behavior related to contention and concurrency control. Analyzing that workload on its own is likely to yield to some interesting discoveries. All that being said, I think there's definitely some things to do here both to bound the memory usage in the intent resolver and account for that in our memory monitoring framework.

Regarding the performance implications of thrashing the page cache, that's out of my area of expertise. It seems that there was some exploration at some point (https://github.com/cockroachdb/cockroach/issues/13620).

The huge amount of memory allocated due to LockAsLockUpdates is startling. We generally don't expect Transaction messages to be huge. One immediate thing I'm seeing is that we are not properly configuring the IntentResolver to provide backpressure based on memory usage. Given the scale of the heap usage we're seeing here, I have a feeling that fixing that alone might head off this OOM.

Here we configure the IntentResolver: https://github.com/cockroachdb/cockroach/blob/7b7da490b9763e08e3f8f915f601ce69847f4549/pkg/kv/kvserver/intentresolver/intent_resolver.go#L228-L236

However, we neglect to set MaxSizePerBatch and we leave the InflightBackpressureLimit at its default, meaning we can have 1000 arbitrarily large batches in flight. We happen to know from the heap profile images that some of these requests are potentially individually very large.

https://github.com/cockroachdb/cockroach/blob/7b7da490b9763e08e3f8f915f601ce69847f4549/pkg/internal/client/requestbatcher/batcher.go#L107-L109

https://github.com/cockroachdb/cockroach/blob/7b7da490b9763e08e3f8f915f601ce69847f4549/pkg/internal/client/requestbatcher/batcher.go#L148-L151

Ideally we'd do some proper memory monitoring which could dynamically tune the concurrency and track the usage. We're in only the earliest days of that sort of heap management. Most of our higher level heap management revolves around failing queries which exceed the available budget and bounding the working set size for everything else. Here we fail to do either.

I suspect we can come up with something for backport to avoid the more pathological cases we're seeing here. I want to more deeply understand what it means for the allocated block sizes in that heap profile to be so big.

Thank you for opening this great issue, @coderplay!

Okay, we've taken a slightly deeper look. We see this is the bulk loading phase of the workload where there's 100000 inserts in a batch.

Can I ask you to run a quick experiment if it's not too much hassle: Try setting SET CLUSTER SETTING kv.transaction.parallel_commits_enabled = false;. If the problem goes away, great, we have a clear view of where our heuristics on running asynchronous consensus are failing us. If not, that's a wonderful data point for us to dig deeper.

Currently we're thinking there's a confluence of things here whereby optimizations to go fully async in replication for certain operations remained in effect when they shouldn't. Given that, decision, the system commits to doing a lot of asynchronous work and not providing any backpressure to the client. It's certainly not the intention of the code to be hitting these asynchronous code paths for bulk writing operations.

The detail in this issue is great. Thanks especially for the heap profiles and workload details. They made the process of unpacking this issue much easier.

+1 to what Andrew said. It would be useful to check whether SET CLUSTER SETTING kv.transaction.parallel_commits_enabled = false; helps avoid the OOM. We suspect that there is an asynchronous code path that is not sufficiently pushing back on the client when async work (async intent resolution) builds.

I've taken a look at the workload in an attempt to reproduce myself. It's worth pointing out what makes this workload unusual. sysbench-tpcc does two moderately unusual things. First, it imports entire million-row tables for a TPC-C dataset with 256 warehouses (--scale=256) in single INSERT statements. Second, it imports 128 copies of this dataset (--tables=128) back-to-back. So it's rapidly performing many bulk insert transactions. This is a good stress test because of the mix of the scale and concurrency of these transactions.

We suspect that there is an asynchronous code path that is not sufficiently pushing back on the client when async work (async intent resolution) builds.

Specifically, we suspect that the ingestion is running into issues because of: https://github.com/cockroachdb/cockroach/blob/3ac48dbccf121bdf7921de97d8b9619ba7693109/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go#L472-L474

Here's some evidence that this is the case. When running the test on my local machine, I see a constant 129 open SQL connections:

However, we see a gradual increase in goroutines as the ingestion runs, indicating a buildup of async work:

A goroutine dump shows the following:

We see about 1000 (intentresolver.defaultTaskLimit) goroutines in the asyncronous path of IntentResolver.CleanupTxnIntentsAsync, with another ~900 in the synchronous path. This synchronous path is meant to push back on the client. However, it is run from the asynchronous path (currently the only path) of txnCommitter.makeTxnCommitExplicitAsync, so there's no pushback and the async cleanup work can't keep up with the ingestion load.

Clearly, something needs to give here. The first would be to address the TODO and limit the amount of async work we can run concurrently from makeTxnCommitExplicitAsync. Beyond that, the current concurrency limit in the IntentResolver feels lacking. It doesn't directly account for the cost to cleanup a transaction, beyond the fact that a more expensive cleanup will hold on to the semaphore for longer. This is not sufficient to bound memory usage in the IntentResolver.

@nvanbenschoten @ajwerner sysbench client crashed after setting SET CLUSTER SETTING kv.transaction.parallel_commits_enabled = false;

I guess you didn't see the oom because client side was down

sysbench client crashed after setting SET CLUSTER SETTING kv.transaction.parallel_commits_enabled = false;

Why did the client crash?

sysbench client crashed after setting SET CLUSTER SETTING kv.transaction.parallel_commits_enabled = false;

Why did the client crash?

Still figuring it out, sysbench faced a fatal error from cockroachdb

(gdb) bt
#0  __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:65
#1  0x00007f750b7b0e3f in __GI___strdup (s=0x0) at strdup.c:41
#2  0x000055ae2183d418 in pgsql_check_status (con=con@entry=0x7f7358000dd0, pgres=pgres@entry=0x7f735800a690, funcname=funcname@entry=0x55ae21847bca "PQexec",
    query=query@entry=0x7f73580169d0 "INSERT INTO stock1 (s_i_id, s_w_id, s_quantity, s_dist_01, s_dist_02, s_dist_03, s_dist_04, s_dist_05, s_dist_06,  s_dist_07, s_dist_08, s_dist_09, s_dist_10, s_ytd, s_order_cnt, s_remote_cnt, s_data)"..., rs=rs@entry=0x7f7358000df8) at drv_pgsql.c:530
#3  0x000055ae2183d56e in pgsql_drv_query (sb_conn=0x7f7358000dd0,
    query=0x7f73580169d0 "INSERT INTO stock1 (s_i_id, s_w_id, s_quantity, s_dist_01, s_dist_02, s_dist_03, s_dist_04, s_dist_05, s_dist_06,  s_dist_07, s_dist_08, s_dist_09, s_dist_10, s_ytd, s_order_cnt, s_remote_cnt, s_data)"..., len=<optimized out>, rs=0x7f7358000df8) at drv_pgsql.c:690
#4  0x000055ae2182f39f in db_query (con=con@entry=0x7f7358000dd0, query=<optimized out>, len=<optimized out>) at db_driver.c:587
#5  0x000055ae2182f487 in db_bulk_do_insert (con=0x7f7358000dd0, is_last=0) at db_driver.c:940
#6  0x000055ae2182fae3 in db_bulk_insert_next (con=0x7f7358000dd0,
    query=0x45bcb5d0 "(62776, 119, 46,'", 'e' <repeats 24 times>, "','", 'h' <repeats 24 times>, "','", 'x' <repeats 24 times>, "','", 'l' <repeats 24 times>, "','", 'r' <repeats 24 times>, "','", 'n' <repeats 24 times>, "','", 'f' <repeats 21 times>..., query_len=333) at db_driver.c:915
#7  0x00007f750ba6158f in lj_vm_ffi_call () at buildvm_x86.dasc:2576
#8  0x00007f750ba9f16f in lj_ccall_func (L=<optimized out>, cd=<optimized out>) at lj_ccall.c:1314
#9  0x00007f750baa16ad in lj_cf_ffi_meta___call (L=0x41488378) at lib_ffi.c:230
#10 0x00007f750ba5f5e7 in lj_BC_FUNCC () at buildvm_x86.dasc:809
#11 0x00007f750bab378f in lua_pcall (L=L@entry=0x41488378, nargs=nargs@entry=1, nresults=nresults@entry=1, errfunc=errfunc@entry=0) at lj_api.c:1158
#12 0x000055ae21834a2b in call_custom_command (L=0x41488378) at sb_lua.c:1495
#13 call_custom_command (L=0x41488378) at sb_lua.c:1475
#14 0x000055ae21834ab8 in cmd_worker_thread (arg=<optimized out>) at sb_lua.c:1527
#15 0x00007f750b8f2ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f750b820def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

(gdb) print *pgres
$1 = {ntups = 0, numAttributes = 0, attDescs = 0x0, tuples = 0x0, tupArrSize = 0, numParameters = 0, paramDescs = 0x0, resultStatus = PGRES_FATAL_ERROR,
  cmdStatus = "\000NSERT 0 1567\000\000Xs\177\000\000\a\000\000\000l_i_\200\245\000Xs\177\000\000\a\000\000\000s\177\000\000\000Z\000Xs\177\000\000\a\000\000\000e\000po\360\234\000X", binary = 0, noticeHooks = {noticeRec = 0x7f750bae6510 <defaultNoticeReceiver>, noticeRecArg = 0x0,
    noticeProc = 0x55ae2183d1f0 <empty_notice_processor>, noticeProcArg = 0x0}, events = 0x0, nEvents = 0, client_encoding = 6,
  errMsg = 0x7f73580068b8 "server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n", errFields = 0x0, errQuery = 0x0, null_field = "", curBlock = 0x7f73580068b0, curOffset = 146, spaceLeft = 1902, memorySize = 2264}

Looks like cockroach closed the connection.

This is all surprising and smells of some sort of client misconfiguration. You can set that cluster setting independently of the test; cluster settings are global and shared. Consider setting it once on your cluster and then running the unmodified workload.

Or, perhaps you're saying you did that. In which case, is there anything interesting going on in the cockroach logs?

Or, perhaps you're saying you did that. In which case, is there anything interesting going on in the cockroach logs?

Yep, I set the cluster setting ahead of time in another session. There are so many warns and errors in the logs, I was lost :)

I was able to reproduce the issue on similar hardware with 128 client threads. Then dropped the client concurrency down to 32 threads and still saw the overload + OOMs. During the tests, I saw a few interesting results:

The test is configured to only use a single SQL gateway, so this is the one that always OOMs.
I saw sysbench segfault at least once when the CRDB gateway was OOM killed. I suspect that's what @coderplay saw above ("cockroach closed the connection").
I saw CRDB OOM even before the IntentResolver reached its 1000 async task limit and begin pushing back on EndTxn requests. This checks out, as we bound intent memory per txn at 4MB (kv.transaction.max_intents_bytes, recently increased from 256KB in defd0de), so even 1000 async tasks could account for 4GB of heap-allocated memory.
There is some memory amplification here as intents are resolved. For instance, making LockUpdate.Txn nullable would halve the footprint of the heap allocations we see in LocksAsLockUpdates. Or better yet, we could rework the contract of IntentResolver.ResolveIntents to avoid the eager *Transaction -> []LockUpdate translation entirely.
The fact that we see OOMs without any IntentResolver pushback indicates that pushing back on async parallel commit work is insufficient to avoid the OOM, which is what @coderplay saw above when he disabled parallel commits. We'll probably want to make IntentResolver's async concurrency a function of each task's memory footprint so that it starts pushing back earlier when each individual task is expensive.

I then switched from testing on v21.2.5 to testing on master. On master, things behaved better and I did not see an OOM with 32 client threads. This appears to be due to admission control, which is disabled by default on v21.2 but enabled by default on master (and in the upcoming v22.1). As the number of async tasks grows, admission control detects the overload through Go scheduler run queue lengths ("Runnable Goroutines per CPU") and begins pushing back on foreground traffic.

On v21.2.5, admission control can be enabled with the following three statements:

SET CLUSTER SETTING admission.kv.enabled = true;
SET CLUSTER SETTING admission.sql_kv_response.enabled = true;
SET CLUSTER SETTING admission.sql_sql_response.enabled = true;

However, when I jumped back up to 128 client threads, I did eventually still see an OOM. So while admission control helps, it's not a full replacement for individual components protecting themselves by enforcing memory limits.

Here are some takeaways for how we can improve the situation:

limit IntentResolver's aggregate memory usage across all async tasks
avoid the *Transaction -> []LockUpdate translation when calling IntentResolver.ResolveIntents using some form of polymorphism
limit parallel commit async work. The way I have this prototyped right now is to limit the number of concurrent parallel commits. So instead of initiating the parallel commit and then running its STAGING -> COMMITTED transition synchronously, fall back to standard 2-phase commit even earlier. This makes more sense, as there's little reason to initiate a parallel commit that won't see the corresponding latency benefit

I'll pull these into separate issues.

@coderplay what are your intentions for this issue? You've demonstrated a data ingestion workload that currently overloads CRDB which we can work to resolve, but I want to make sure you're not stuck waiting for that. If this is blocking you, you can drop the client concurrency during sysbench's prepare phase (e.g. --threads=16). It would also help during the prepare and run phase to use all three CRDB nodes SQL gateways by sticking HAProxy between sysbench and Cockroach.

@nvanbenschoten

The test is configured to only use a single SQL gateway, so this is the one that always OOMs.

That leads to another issue we noticed. We actually created a LoadBalancer based service for our CockroachDB cluster, and benchmarked with --pgsql-host=$loadbalancer_ip instead. We expected the overhead to be distributed evenly, but we could always see one node's # leaseholders much larger than those of the rest nodes. The OOM still happened first on the node with the largest # leaseholders.

I saw sysbench segfault at least once when the CRDB gateway was OOM killed. I suspect that's what @coderplay saw above ("cockroach closed the connection").

Did you reproduce the sysbench segfault after setting SET CLUSTER SETTING kv.transaction.parallel_commits_enabled = false; ?

what are your intentions for this issue?

We might be able to workaround for this particular case. What worries me more is that Cockroach lacks a systematic mechanism to bound its memory usage (Go heap + Cgo + system). As a company-wide DBMS provider, we never know when a bad query from any one of our internal users will turn down our cluster.

Any updates on this?

Hey @coderplay, thanks again for filing the issue and sorry for the delay in response here. We're busy trying to wrap up work for our upcoming 22.1 release.

Improving CockroachDB's memory management is an important goal for us, and one that we're continually working on. We're in agreement with you that the behavior that you pointed out is problematic.

Just to respond to your last point here:

What worries me more is that Cockroach lacks a systematic mechanism to bound its memory usage (Go heap + Cgo + system).

We've put a lot of effort into CockroachDB's memory monitoring system over the years, and it's not true that we lack a systematic mechanism to bound memory usage. If you're interested to learn about how it works, have a look at this file and the places that it's used throughout the codebase: https://github.com/cockroachdb/cockroach/blob/master/pkg/util/mon/bytes_usage.go

The system allows components of the database to register the memory that they use with a hierarchy of monitors that begin at the SQL operator level and ladder up through queries, sessions, and ultimately to the limit that's governed by --max-sql-memory. This system is also how CockroachDB accomplishes disk spilling for queries that cannot fit into memory.

While simple, this system is powerful enough to prevent the database from running out of memory in most workloads that we test with. The main downside to the system is that it's cooperative: systems within the database must know to register memory, especially memory that's allocated proportionally to any user-input value.

As you've correctly pointed out, the database fails to manage memory properly under the workload that you've been testing with: a very highly-concurrent implementation of a bulk ingest of TPCC data. We've filed issues that will resolve this problem when closed.

However, I do want to point out that the case you're testing is somewhat extreme. Of course, we aspire to make CockroachDB work in every scenario, and the fact that it doesn't in this case is certainly a problem. But it's important to note that this somewhat extreme use case is pretty far from workloads that we see in the wild.

One last note: I want to re-express gratitude for your hard work in filing this detailed, helpful issue. We really appreciate it!

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

cockroachdb / cockroach