cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.21k stars 3.82k forks source link

kv: many large, concurrent INSERTs can overload intent resolution and lead to OOM #76282

Open coderplay opened 2 years ago

coderplay commented 2 years ago

Describe the problem

Please describe the issue you observed, and any steps we can take to reproduce it:

There are many issues found when we were doing the tpcc data preparation with CockroachDB. This post only focuses on memory issues.

One of the CockroachDB pod was evicted by Kubernetes for exceeding the pod memory limit during the data preparation, therefore the pod get restarted. We assumed it was caused by the Pod QoS class, because we set pod’s memory request (16GB) far lower than the memory limit (26GB). When a node runs out of memory, Kubernetes is more likely to evict the Burstable pod. So we tried setting request=limit (26GB), but the container was OOMKilled after running the data preparation for a while.

The Go heap dump showed that CockroachDB only allocated 4~6GB (varies by run), plus the 26 GB * 25% = 6.5 GB we set aside for RocksDB/pebble cache. There is still around 13 GB headroom before witnessing OOM. We suspected that many unreachable garbage objects were undetected by heap profiling. So we tried lowering the GOGC parameter (which is set to 100 by default) to make CockroachDB more aggressive in cleaning up the garbage, but it didn’t work.

One of the other runs showed that the Go heap usage can reach 11GB, while the RocksDB cache (precisely CGo native) part can reach 15GB despite the 26 GB * 25% = 6.5 GB setting.

A similar issue was reported (*: node crash with OOM despite memory limits set) in 2017, but it was left unresolved.

To Reproduce

What did you do? Describe in your own words.

If possible, provide steps to reproduce the behavior:

  1. Start a 3 nodes cockroach cluster on top of Kubernetes
  2. create database sbtest
  3. Run Percona's sysbench-tpcc benchmark like below:
    kubectl --context ${clusterName} run tpcc -it --image=mzhou/sysbench-tpcc --restart=Never --rm /bin/bash
    # ./tpcc.lua --pgsql-user=root --pgsql-db=sbtest --time=7200 --threads=128 --report-interval=1 --tables=128 --scale=256 --db-driver=pgsql --pgsql-port=26257 --pgsql-host=cockroachdb --use-fk=0 prepare
  4. Monitor the memory usage from the Cockroach dashboard. The container should be killed within 10 minutes.

We chose to use sysbench-tpcc because we believe third-party testing tools will be more impartial.

Environment: Hardware

Software

Expected behavior

Jira issue: CRDB-13059

blathers-crl[bot] commented 2 years ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

coderplay commented 2 years ago

This doc says that the memory usage of Cockroach can be divided into the following three parts.

The following is the memory usage dashboard of a crash node. Right before the crashing, the Go allocated part took 10.85GB (varies by run, the range is 6 ~11GB), which includes both live objects and unclaimed garbage. The CGo part took 6.77GB. There are still 24.55 - 10.85 - 6.77 = 6.93 GB gap taken by the overhead for this particular run.

Screen Shot 2022-02-09 at 10 57 41 AM

After reproducing the crash many times, the following clues were found in the logs of oom-killer:

kernel: cockroach invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=-997
kernel: CPU: 10 PID: 4771 Comm: cockroach Not tainted 5.4.0-1067-azure #70~18.04.1-Ubuntu
kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 10/27/2020
kernel: Call Trace:
kernel:  dump_stack+0x57/0x6d
kernel:  dump_header+0x4f/0x200
kernel:  oom_kill_process+0xe6/0x120
kernel:  out_of_memory+0x117/0x540
kernel:  mem_cgroup_out_of_memory+0xbb/0xd0
kernel:  try_charge+0x6bf/0x7c0
kernel:  ? __alloc_pages_nodemask+0x153/0x320
kernel:  mem_cgroup_try_charge+0x75/0x190
kernel:  mem_cgroup_try_charge_delay+0x22/0x50
kernel:  __handle_mm_fault+0x943/0x1330
kernel:  ? schedule+0x33/0xa0
kernel:  handle_mm_fault+0xb7/0x200
kernel:  __do_page_fault+0x29c/0x4c0
kernel:  ? __switch_to_asm+0x34/0x70
kernel:  do_page_fault+0x35/0x110
kernel:  page_fault+0x39/0x40
kernel: RIP: 0010:copy_user_enhanced_fast_string+0xe/0x30
kernel: Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f 01 ca c3 0f 1f 80 00 00 00 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3> a4 31 c0 0f 01 ca c3 66 2e 0f 1f 84 00 00 00 00 00 89 d1
kernel: RSP: 0018:ffff9ad704697c80 EFLAGS: 00050206
kernel: RAX: 00007f328f69a576 RBX: ffff8b157f19a000 RCX: 0000000000000576
kernel: RDX: 0000000000001000 RSI: ffff8b157f19aa8a RDI: 00007f328f69a000
kernel: RBP: ffff9ad704697c88 R08: ffff9ad704697e10 R09: 0000000000000001
kernel: R10: ffff9ad70e317b98 R11: 0000000000000003 R12: 0000000000001000
kernel: R13: ffff9ad704697e20 R14: 0000000000001000 R15: 000000000000aa96
kernel:  ? copyout+0x26/0x30
kernel:  copy_page_to_iter+0x179/0x310
kernel:  generic_file_read_iter+0x4bf/0xd30
kernel:  ? file_fdatawait_range+0x30/0x30
kernel:  ext4_file_read_iter+0x56/0xf0
kernel:  new_sync_read+0x122/0x1b0
kernel:  __vfs_read+0x29/0x40
kernel:  vfs_read+0x8e/0x130
kernel:  ksys_pread64+0x76/0x90
kernel:  __x64_sys_pread64+0x1e/0x20
kernel:  do_syscall_64+0x5e/0x200
kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
kernel: RIP: 0033:0x521caa
kernel: Code: e8 9b ec f9 ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 4c 8b 54 24 28 4c 8b 44 24 30 4c 8b 4c 24 38 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 40 ff ff ff ff 48 c7 44
kernel: RSP: 002b:000000c175088010 EFLAGS: 00000202 ORIG_RAX: 0000000000000011
kernel: RAX: ffffffffffffffda RBX: 000000c00008c000 RCX: 0000000000521caa
kernel: RDX: 0000000000077fca RSI: 00007f328f68eae0 RDI: 00000000000013a8
kernel: RBP: 000000c175088078 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 00000000007e556a R11: 0000000000000202 R12: 000000000000002c
kernel: R13: 000000000000002d R14: 000000000000005e R15: 00007f3290b98c1a
kernel: memory: usage 26214400kB, limit 26214400kB, failcnt 5967298
kernel: memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kernel: kmem: usage 164168kB, limit 9007199254740988kB, failcnt 0
kernel: Memory cgroup stats for /kubepods/pod88925984-2ff3-4d7a-8cf5-643d382a58b0:
kernel: anon 26668752896
        file 5009408
        kernel_stack 4939776
        slab 69935104
        sock 0
        shmem 0
        file_mapped 540672
        file_dirty 0
        file_writeback 0
        anon_thp 5022679040
        inactive_anon 0
        active_anon 26669301760
        inactive_file 3158016
        active_file 0
        unevictable 0
        slab_reclaimable 59940864
        slab_unreclaimable 9994240
        pgfault 251812214
        pgmajfault 0
        workingset_refault 3215933
        workingset_activate 524174
        workingset_nodereclaim 33
        pgrefill 2207339
        pgscan 22319161
        pgsteal 11889145
        pgactivate 139085
        pgdeactivate 2207173
        pglazyfree 0
        pglazyfreed 0
        thp_fault_alloc 127484
        thp_collapse_alloc 670
kernel: Tasks state (memory values in pages):
kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
kernel: [  13635] 65535 13635      241        1    28672        0          -998 pause
kernel: [  25226]     0 25226 12836589  6535336 94203904        0          -997 cockroach
kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=d99293475009fd398e2ea74fb6006aa41420356570599c59536a9c3832e29495,mems_allowed=0,oom_memcg=/kubepods/pod88925984-2ff3-4d7a-8cf5-643d382
kernel: Memory cgroup out of memory: Killed process 25226 (cockroach) total-vm:51346356kB, anon-rss:26028716kB, file-rss:112628kB, shmem-rss:0kB, UID:0 pgtables:91996kB oom_score_adj:-997
kernel: oom_reaper: reaped process 25226 (cockroach), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

According to the Linux kernel's cgroups documentation, when a page of the page cache is loaded into physical memory, it is accounted for in the RSS of the user process. In order to verify the hypotheis that page cache occupied the rest memory, the following experiment was done: run a task on the cockroach pod that periodically drops the page cache from the system. It turned out that oom-killing was mitigated.

Page cache

In order to figure out which part of the cockroach logic loads a lot of page cache, a perf profiling was done. The results showed that pebble always load sstable files into page cache. In our particular use case, there were heavy page cache loading when pebble was doing compactions. Cockroach read a lot of small sstables, did merge, then wrote them into large sstables. The whole process does not use direct I/O. Also, cockroach has no way to enable use_direct_io_for_flush_and_compaction when using the RocksDB engine. Below is the stack trace of one page load.

ffffffff9401be21 ondemand_readahead+0x1 ([kernel.kallsyms])
ffffffff940122fe generic_file_read_iter+0x85e ([kernel.kallsyms])
ffffffff94186d36 ext4_file_read_iter+0x56 ([kernel.kallsyms])
ffffffff940c57e2 new_sync_read+0x122 ([kernel.kallsyms])
ffffffff940c8999 __vfs_read+0x29 ([kernel.kallsyms])
ffffffff940c8a3e vfs_read+0x8e ([kernel.kallsyms])
ffffffff940c8de6 ksys_pread64+0x76 ([kernel.kallsyms])
ffffffff940c8e1e __x64_sys_pread64+0x1e ([kernel.kallsyms])
ffffffff93e0272e do_syscall_64+0x5e ([kernel.kallsyms])
ffffffff94a0008c entry_SYSCALL_64_after_hwframe+0x44 ([kernel.kallsyms])
521caa           syscall.Syscall6+0x2a (/cockroach/cockroach)
54165f           internal/poll.(*FD).Pread+0xdf (/cockroach/cockroach)
54ad74           os.(*File).ReadAt+0x114 (/cockroach/cockroach)
169c852          github.com/cockroachdb/pebble/vfs.(*enospcFile).ReadAt+0x72 (/cockroach/cockroach)
169b6a8          github.com/cockroachdb/pebble/vfs.(*fdFileWrapper).ReadAt+0x68 (/cockroach/cockroach)
16eb2ab          github.com/cockroachdb/pebble/sstable.(*Reader).readBlock+0x1cb (/cockroach/cockroach)
16e4009          github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).loadBlock+0x169 (/cockroach/cockroach)
16e7688          github.com/cockroachdb/pebble/sstable.(*compactionIterator).skipForward+0x148 (/cockroach/cockroach)
16e7497          github.com/cockroachdb/pebble/sstable.(*compactionIterator).Next+0x77 (/cockroach/cockroach)
1748f91          github.com/cockroachdb/pebble.(*levelIter).Next+0x1f1 (/cockroach/cockroach)
174cf3c          github.com/cockroachdb/pebble.(*mergingIter).nextEntry+0x7c (/cockroach/cockroach)
174f935          github.com/cockroachdb/pebble.(*mergingIter).Next+0x75 (/cockroach/cockroach)
171aa38          github.com/cockroachdb/pebble.(*compactionIter).iterNext+0x38 (/cockroach/cockroach)
171ab2f          github.com/cockroachdb/pebble.(*compactionIter).nextInStripe+0x2f (/cockroach/cockroach)
171af2e          github.com/cockroachdb/pebble.(*compactionIter).setNext+0x1ee (/cockroach/cockroach)
171a088          github.com/cockroachdb/pebble.(*compactionIter).Next+0x4a8 (/cockroach/cockroach)
1715196          github.com/cockroachdb/pebble.(*DB).runCompaction+0xb76 (/cockroach/cockroach)
1713d05          github.com/cockroachdb/pebble.(*DB).compact1+0x1c5 (/cockroach/cockroach)
176b392          github.com/cockroachdb/pebble.(*DB).compact.func1+0x92 (/cockroach/cockroach)
c4172f           runtime/pprof.Do+0xcf (/cockroach/cockroach)
1713b25          github.com/cockroachdb/pebble.(*DB).compact+0xa5 (/cockroach/cockroach)
4c5cc1           runtime.goexit+0x1 (/cockroach/cockroach)

Go allocated memory

Meanwhile, the memory usage of Go allocated part is much larger than expected. The following CockroachDB data structures occupy the majority of the heap, according to heap dumps:

See this image for detailed information.

Triage:

Sysbench-tpcc insert large bulks of data into stock tables. A single insertion can include up to 100K entries, or around 36MB. The load of these data requires around 4.6GB of RAM if 128 threads are all active. Cockroach consumes these bulks by allocating a large amount of heap space rather than operating them one-by-one like consuming a stream.

Cgo allocated memory

Although Pebble added a cache implementation to limit the size of Cgo allocated memory, we still noticed cases where the Cgo allocated part exceeded the limit. Below are the Cgo memory stats generated by CockroachDB.

memstats.2022-02-06T01_06_37.010.26758971392.txt:  "CGoAllocatedBytes": 13427655920,
memstats.2022-02-06T01_06_26.963.26393677824.txt:  "CGoAllocatedBytes": 12980342080,
memstats.2022-02-05T08_37_43.929.27757682688.txt:  "CGoAllocatedBytes": 12520257920,
memstats.2022-02-06T01_06_16.968.25532248064.txt:  "CGoAllocatedBytes": 12285956608,
memstats.2022-02-05T08_37_33.941.27373121536.txt:  "CGoAllocatedBytes": 12251580896,

Jemalloc profile only gives us the leaf function who allocated the memory:

Total: 7443.8 MB  
7443.3 100.0% 100.0%   7443.3 100.0% _cgo_82be0c43bdb2_Cfunc_calloc
     0.5   0.0% 100.0%      0.5   0.0% _GLOBAL__sub_I_eh_alloc.cc
     0.0   0.0% 100.0%      0.5   0.0% 00007ffa5b84441d
     0.0   0.0% 100.0%      0.5   0.0% __libc_csu_init
     0.0   0.0% 100.0%      0.5   0.0% _start
     0.0   0.0% 100.0%   7443.3 100.0% runtime.asmcgocall

Perf profiling shows that majority of the native memory allocation come from sstable reading:

100.00%   100.00%  (45858c0)
            |
            ---runtime.goexit
               |
               |--67.01%--github.com/cockroachdb/pebble.(*DB).compact
               |          runtime/pprof.Do
               |          github.com/cockroachdb/pebble.(*DB).compact.func1
               |          github.com/cockroachdb/pebble.(*DB).compact1
               |          github.com/cockroachdb/pebble.(*DB).runCompaction
               |          |
               |          |--66.05%--github.com/cockroachdb/pebble.(*compactionIter).Next
               |          |          |
               |          |          |--63.48%--github.com/cockroachdb/pebble.(*compactionIter).setNext
               |          |          |          github.com/cockroachdb/pebble.(*compactionIter).nextInStripe
               |          |          |          github.com/cockroachdb/pebble.(*compactionIter).iterNext
               |          |          |          github.com/cockroachdb/pebble.(*mergingIter).Next
               |          |          |          github.com/cockroachdb/pebble.(*mergingIter).nextEntry
               |          |          |          |
               |          |          |          |--39.49%--github.com/cockroachdb/pebble.(*levelIter).Next
               |          |          |          |          |
               |          |          |          |           --39.31%--github.com/cockroachdb/pebble/sstable.(*compactionIterator).Next
               |          |          |          |                     github.com/cockroachdb/pebble/sstable.(*compactionIterator).skipForward
               |          |          |          |                     github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).loadBlock
               |          |          |          |                     github.com/cockroachdb/pebble/sstable.(*Reader).readBlock
               |          |          |          |                     |
               |          |          |          |                     |--17.94%--github.com/cockroachdb/pebble/sstable.decompressBlock
               |          |          |          |                     |          github.com/cockroachdb/pebble/internal/cache.newValue
               |          |          |          |                     |          github.com/cockroachdb/pebble/internal/manual.New
               |          |          |          |                     |          github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc
               |          |          |          |                     |          _cgo_82be0c43bdb2_Cfunc_calloc
               |          |          |          |                     |
               |          |          |          |                     |--17.94%--github.com/cockroachdb/pebble/internal/cache.newValue
               |          |          |          |                     |          github.com/cockroachdb/pebble/internal/manual.New
               |          |          |          |                     |          github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc
               |          |          |          |                     |          _cgo_82be0c43bdb2_Cfunc_calloc
               |          |          |          |                     |
               |          |          |          |                      --3.42%--github.com/cockroachdb/pebble/internal/cache.(*Cache).Set
               |          |          |          |                                github.com/cockroachdb/pebble/internal/cache.(*shard).Set
               |          |          |          |                                github.com/cockroachdb/pebble/internal/cache.newEntry
               |          |          |          |                                github.com/cockroachdb/pebble/internal/cache.entryAllocNew
               |          |          |          |                                github.com/cockroachdb/pebble/internal/cache.(*entryAllocCache).alloc
               |          |          |          |                                github.com/cockroachdb/pebble/internal/manual.New
               |          |          |          |                                github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc
               |          |          |          |                                _cgo_82be0c43bdb2_Cfunc_calloc
               |          |          |          |
               |          |          |           --23.98%--github.com/cockroachdb/pebble/sstable.(*compactionIterator).Next
               |          |          |                     github.com/cockroachdb/pebble/sstable.(*compactionIterator).skipForward
               |          |          |                     github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).loadBlock
               |          |          |                     github.com/cockroachdb/pebble/sstable.(*Reader).readBlock
               |          |          |                     |
               |          |          |                     |--11.31%--github.com/cockroachdb/pebble/sstable.decompressBlock
               |          |          |                     |          github.com/cockroachdb/pebble/internal/cache.newValue
               |          |          |                     |          github.com/cockroachdb/pebble/internal/manual.New
               |          |          |                     |          github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc
               |          |          |                     |          _cgo_82be0c43bdb2_Cfunc_calloc
               |          |          |                     |
               |          |          |                     |--11.31%--github.com/cockroachdb/pebble/internal/cache.newValue
               |          |          |                     |          github.com/cockroachdb/pebble/internal/manual.New
               |          |          |                     |          github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc
               |          |          |                     |          _cgo_82be0c43bdb2_Cfunc_calloc
               |          |          |                     |
               |          |          |                      --1.36%--github.com/cockroachdb/pebble/internal/cache.(*Cache).Set
               |          |          |                                github.com/cockroachdb/pebble/internal/cache.(*shard).Set
               |          |          |                                github.com/cockroachdb/pebble/internal/cache.newEntry
               |          |          |                                github.com/cockroachdb/pebble/internal/cache.entryAllocNew
               |          |          |                                github.com/cockroachdb/pebble/internal/cache.(*entryAllocCache).alloc
               |          |          |                                github.com/cockroachdb/pebble/internal/manual.New
               |          |          |                                github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc
               |          |          |                                _cgo_82be0c43bdb2_Cfunc_calloc
        |
               |--24.29%--github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
               |          |
               |          |--19.29%--github.com/cockroachdb/cockroach/pkg/internal/client/requestbatcher.(*RequestBatcher).sendBatch.func1
               |          |          |
               |          |          |--17.42%--github.com/cockroachdb/cockroach/pkg/internal/client/requestbatcher.(*RequestBatcher).sendBatch.func1.1
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch
               |          |          |          github.com/cockroachdb/cockroach/pkg/rpc.(*internalClientAdapter).Batch
               |          |          |          github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch
               |          |          |          github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal
               |          |          |          github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr
               |          |          |          github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evalAndPropose
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).requestToProposal
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateProposal
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatch
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWithServersideRefreshes
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWrapper
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateBatch
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateCommand
               |          |          |          github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval.ResolveIntent
               |          |          |          github.com/cockroachdb/cockroach/pkg/storage.MVCCResolveWriteIntent
               |          |          |          |
               |          |          |           --17.41%--github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingIter).SeekIntentGE
               |          |          |                     |
               |          |          |                     |--9.71%--github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).SeekEngineKeyGEWithLimit
               |          |          |                     |          github.com/cockroachdb/pebble.(*Iterator).SeekPrefixGE
               |          |          |                     |          github.com/cockroachdb/pebble.(*mergingIter).SeekPrefixGE
               |          |          |                     |          github.com/cockroachdb/pebble.(*mergingIter).seekGE
               |          |          |                     |          github.com/cockroachdb/pebble.(*levelIter).SeekPrefixGE
               |          |          |                     |          |
               |          |          |                     |           --9.57%--github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).SeekPrefixGE
               |          |          |                     |                     github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).seekPrefixGE

Summary

ajwerner commented 2 years ago

The thrust of this issue is absolutely correct: cockroach should not OOM and should manage its heap usage well is completely legitimate. We've been fighting that fight for a long time (https://github.com/cockroachdb/cockroach/issues/10320) and will continue to fight that fight and improve out memory accounting. The heap profile you're providing here is very valuable. The go heap usage you see there related to intent resolution is a good find and the reproduction steps are much appreciated.

Clearly there's an issue here related to memory usage underneath the asynchronous intent resolver. Not excusing the OOM, it should never happen, I suspect that sysbench-tpcc may be running into some pathological behavior related to contention and concurrency control. Analyzing that workload on its own is likely to yield to some interesting discoveries. All that being said, I think there's definitely some things to do here both to bound the memory usage in the intent resolver and account for that in our memory monitoring framework.

Regarding the performance implications of thrashing the page cache, that's out of my area of expertise. It seems that there was some exploration at some point (https://github.com/cockroachdb/cockroach/issues/13620).

ajwerner commented 2 years ago

The huge amount of memory allocated due to LockAsLockUpdates is startling. We generally don't expect Transaction messages to be huge. One immediate thing I'm seeing is that we are not properly configuring the IntentResolver to provide backpressure based on memory usage. Given the scale of the heap usage we're seeing here, I have a feeling that fixing that alone might head off this OOM.

Here we configure the IntentResolver: https://github.com/cockroachdb/cockroach/blob/7b7da490b9763e08e3f8f915f601ce69847f4549/pkg/kv/kvserver/intentresolver/intent_resolver.go#L228-L236

However, we neglect to set MaxSizePerBatch and we leave the InflightBackpressureLimit at its default, meaning we can have 1000 arbitrarily large batches in flight. We happen to know from the heap profile images that some of these requests are potentially individually very large.

https://github.com/cockroachdb/cockroach/blob/7b7da490b9763e08e3f8f915f601ce69847f4549/pkg/internal/client/requestbatcher/batcher.go#L107-L109

https://github.com/cockroachdb/cockroach/blob/7b7da490b9763e08e3f8f915f601ce69847f4549/pkg/internal/client/requestbatcher/batcher.go#L148-L151

Ideally we'd do some proper memory monitoring which could dynamically tune the concurrency and track the usage. We're in only the earliest days of that sort of heap management. Most of our higher level heap management revolves around failing queries which exceed the available budget and bounding the working set size for everything else. Here we fail to do either.

I suspect we can come up with something for backport to avoid the more pathological cases we're seeing here. I want to more deeply understand what it means for the allocated block sizes in that heap profile to be so big.

joshimhoff commented 2 years ago

Thank you for opening this great issue, @coderplay!

ajwerner commented 2 years ago

Okay, we've taken a slightly deeper look. We see this is the bulk loading phase of the workload where there's 100000 inserts in a batch.

Can I ask you to run a quick experiment if it's not too much hassle: Try setting SET CLUSTER SETTING kv.transaction.parallel_commits_enabled = false;. If the problem goes away, great, we have a clear view of where our heuristics on running asynchronous consensus are failing us. If not, that's a wonderful data point for us to dig deeper.

Currently we're thinking there's a confluence of things here whereby optimizations to go fully async in replication for certain operations remained in effect when they shouldn't. Given that, decision, the system commits to doing a lot of asynchronous work and not providing any backpressure to the client. It's certainly not the intention of the code to be hitting these asynchronous code paths for bulk writing operations.

The detail in this issue is great. Thanks especially for the heap profiles and workload details. They made the process of unpacking this issue much easier.

nvanbenschoten commented 2 years ago

+1 to what Andrew said. It would be useful to check whether SET CLUSTER SETTING kv.transaction.parallel_commits_enabled = false; helps avoid the OOM. We suspect that there is an asynchronous code path that is not sufficiently pushing back on the client when async work (async intent resolution) builds.

I've taken a look at the workload in an attempt to reproduce myself. It's worth pointing out what makes this workload unusual. sysbench-tpcc does two moderately unusual things. First, it imports entire million-row tables for a TPC-C dataset with 256 warehouses (--scale=256) in single INSERT statements. Second, it imports 128 copies of this dataset (--tables=128) back-to-back. So it's rapidly performing many bulk insert transactions. This is a good stress test because of the mix of the scale and concurrency of these transactions.

nvanbenschoten commented 2 years ago

We suspect that there is an asynchronous code path that is not sufficiently pushing back on the client when async work (async intent resolution) builds.

Specifically, we suspect that the ingestion is running into issues because of: https://github.com/cockroachdb/cockroach/blob/3ac48dbccf121bdf7921de97d8b9619ba7693109/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go#L472-L474

Here's some evidence that this is the case. When running the test on my local machine, I see a constant 129 open SQL connections:

Screen Shot 2022-02-10 at 6 17 20 PM

However, we see a gradual increase in goroutines as the ingestion runs, indicating a buildup of async work:

Screen Shot 2022-02-10 at 6 18 28 PM

A goroutine dump shows the following:

Screen Shot 2022-02-10 at 6 19 22 PM

We see about 1000 (intentresolver.defaultTaskLimit) goroutines in the asyncronous path of IntentResolver.CleanupTxnIntentsAsync, with another ~900 in the synchronous path. This synchronous path is meant to push back on the client. However, it is run from the asynchronous path (currently the only path) of txnCommitter.makeTxnCommitExplicitAsync, so there's no pushback and the async cleanup work can't keep up with the ingestion load.

Clearly, something needs to give here. The first would be to address the TODO and limit the amount of async work we can run concurrently from makeTxnCommitExplicitAsync. Beyond that, the current concurrency limit in the IntentResolver feels lacking. It doesn't directly account for the cost to cleanup a transaction, beyond the fact that a more expensive cleanup will hold on to the semaphore for longer. This is not sufficient to bound memory usage in the IntentResolver.

coderplay commented 2 years ago

@nvanbenschoten @ajwerner sysbench client crashed after setting SET CLUSTER SETTING kv.transaction.parallel_commits_enabled = false;

I guess you didn't see the oom because client side was down

nvanbenschoten commented 2 years ago

sysbench client crashed after setting SET CLUSTER SETTING kv.transaction.parallel_commits_enabled = false;

Why did the client crash?

coderplay commented 2 years ago

sysbench client crashed after setting SET CLUSTER SETTING kv.transaction.parallel_commits_enabled = false;

Why did the client crash?

Still figuring it out, sysbench faced a fatal error from cockroachdb

(gdb) bt
#0  __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:65
#1  0x00007f750b7b0e3f in __GI___strdup (s=0x0) at strdup.c:41
#2  0x000055ae2183d418 in pgsql_check_status (con=con@entry=0x7f7358000dd0, pgres=pgres@entry=0x7f735800a690, funcname=funcname@entry=0x55ae21847bca "PQexec",
    query=query@entry=0x7f73580169d0 "INSERT INTO stock1 (s_i_id, s_w_id, s_quantity, s_dist_01, s_dist_02, s_dist_03, s_dist_04, s_dist_05, s_dist_06,  s_dist_07, s_dist_08, s_dist_09, s_dist_10, s_ytd, s_order_cnt, s_remote_cnt, s_data)"..., rs=rs@entry=0x7f7358000df8) at drv_pgsql.c:530
#3  0x000055ae2183d56e in pgsql_drv_query (sb_conn=0x7f7358000dd0,
    query=0x7f73580169d0 "INSERT INTO stock1 (s_i_id, s_w_id, s_quantity, s_dist_01, s_dist_02, s_dist_03, s_dist_04, s_dist_05, s_dist_06,  s_dist_07, s_dist_08, s_dist_09, s_dist_10, s_ytd, s_order_cnt, s_remote_cnt, s_data)"..., len=<optimized out>, rs=0x7f7358000df8) at drv_pgsql.c:690
#4  0x000055ae2182f39f in db_query (con=con@entry=0x7f7358000dd0, query=<optimized out>, len=<optimized out>) at db_driver.c:587
#5  0x000055ae2182f487 in db_bulk_do_insert (con=0x7f7358000dd0, is_last=0) at db_driver.c:940
#6  0x000055ae2182fae3 in db_bulk_insert_next (con=0x7f7358000dd0,
    query=0x45bcb5d0 "(62776, 119, 46,'", 'e' <repeats 24 times>, "','", 'h' <repeats 24 times>, "','", 'x' <repeats 24 times>, "','", 'l' <repeats 24 times>, "','", 'r' <repeats 24 times>, "','", 'n' <repeats 24 times>, "','", 'f' <repeats 21 times>..., query_len=333) at db_driver.c:915
#7  0x00007f750ba6158f in lj_vm_ffi_call () at buildvm_x86.dasc:2576
#8  0x00007f750ba9f16f in lj_ccall_func (L=<optimized out>, cd=<optimized out>) at lj_ccall.c:1314
#9  0x00007f750baa16ad in lj_cf_ffi_meta___call (L=0x41488378) at lib_ffi.c:230
#10 0x00007f750ba5f5e7 in lj_BC_FUNCC () at buildvm_x86.dasc:809
#11 0x00007f750bab378f in lua_pcall (L=L@entry=0x41488378, nargs=nargs@entry=1, nresults=nresults@entry=1, errfunc=errfunc@entry=0) at lj_api.c:1158
#12 0x000055ae21834a2b in call_custom_command (L=0x41488378) at sb_lua.c:1495
#13 call_custom_command (L=0x41488378) at sb_lua.c:1475
#14 0x000055ae21834ab8 in cmd_worker_thread (arg=<optimized out>) at sb_lua.c:1527
#15 0x00007f750b8f2ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f750b820def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
coderplay commented 2 years ago
(gdb) print *pgres
$1 = {ntups = 0, numAttributes = 0, attDescs = 0x0, tuples = 0x0, tupArrSize = 0, numParameters = 0, paramDescs = 0x0, resultStatus = PGRES_FATAL_ERROR,
  cmdStatus = "\000NSERT 0 1567\000\000Xs\177\000\000\a\000\000\000l_i_\200\245\000Xs\177\000\000\a\000\000\000s\177\000\000\000Z\000Xs\177\000\000\a\000\000\000e\000po\360\234\000X", binary = 0, noticeHooks = {noticeRec = 0x7f750bae6510 <defaultNoticeReceiver>, noticeRecArg = 0x0,
    noticeProc = 0x55ae2183d1f0 <empty_notice_processor>, noticeProcArg = 0x0}, events = 0x0, nEvents = 0, client_encoding = 6,
  errMsg = 0x7f73580068b8 "server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n", errFields = 0x0, errQuery = 0x0, null_field = "", curBlock = 0x7f73580068b0, curOffset = 146, spaceLeft = 1902, memorySize = 2264}

Looks like cockroach closed the connection.

ajwerner commented 2 years ago

This is all surprising and smells of some sort of client misconfiguration. You can set that cluster setting independently of the test; cluster settings are global and shared. Consider setting it once on your cluster and then running the unmodified workload.

ajwerner commented 2 years ago

Or, perhaps you're saying you did that. In which case, is there anything interesting going on in the cockroach logs?

coderplay commented 2 years ago

Or, perhaps you're saying you did that. In which case, is there anything interesting going on in the cockroach logs?

Yep, I set the cluster setting ahead of time in another session. There are so many warns and errors in the logs, I was lost :)

nvanbenschoten commented 2 years ago

I was able to reproduce the issue on similar hardware with 128 client threads. Then dropped the client concurrency down to 32 threads and still saw the overload + OOMs. During the tests, I saw a few interesting results:

I then switched from testing on v21.2.5 to testing on master. On master, things behaved better and I did not see an OOM with 32 client threads. This appears to be due to admission control, which is disabled by default on v21.2 but enabled by default on master (and in the upcoming v22.1). As the number of async tasks grows, admission control detects the overload through Go scheduler run queue lengths ("Runnable Goroutines per CPU") and begins pushing back on foreground traffic.

On v21.2.5, admission control can be enabled with the following three statements:

SET CLUSTER SETTING admission.kv.enabled = true;
SET CLUSTER SETTING admission.sql_kv_response.enabled = true;
SET CLUSTER SETTING admission.sql_sql_response.enabled = true;

However, when I jumped back up to 128 client threads, I did eventually still see an OOM. So while admission control helps, it's not a full replacement for individual components protecting themselves by enforcing memory limits.

Here are some takeaways for how we can improve the situation:

I'll pull these into separate issues.

@coderplay what are your intentions for this issue? You've demonstrated a data ingestion workload that currently overloads CRDB which we can work to resolve, but I want to make sure you're not stuck waiting for that. If this is blocking you, you can drop the client concurrency during sysbench's prepare phase (e.g. --threads=16). It would also help during the prepare and run phase to use all three CRDB nodes SQL gateways by sticking HAProxy between sysbench and Cockroach.

coderplay commented 2 years ago

@nvanbenschoten

The test is configured to only use a single SQL gateway, so this is the one that always OOMs.

That leads to another issue we noticed. We actually created a LoadBalancer based service for our CockroachDB cluster, and benchmarked with --pgsql-host=$loadbalancer_ip instead. We expected the overhead to be distributed evenly, but we could always see one node's # leaseholders much larger than those of the rest nodes. The OOM still happened first on the node with the largest # leaseholders.

I saw sysbench segfault at least once when the CRDB gateway was OOM killed. I suspect that's what @coderplay saw above ("cockroach closed the connection").

Did you reproduce the sysbench segfault after setting SET CLUSTER SETTING kv.transaction.parallel_commits_enabled = false; ?

what are your intentions for this issue?

We might be able to workaround for this particular case. What worries me more is that Cockroach lacks a systematic mechanism to bound its memory usage (Go heap + Cgo + system). As a company-wide DBMS provider, we never know when a bad query from any one of our internal users will turn down our cluster.

coderplay commented 2 years ago

Any updates on this?

jordanlewis commented 2 years ago

Hey @coderplay, thanks again for filing the issue and sorry for the delay in response here. We're busy trying to wrap up work for our upcoming 22.1 release.

Improving CockroachDB's memory management is an important goal for us, and one that we're continually working on. We're in agreement with you that the behavior that you pointed out is problematic.

Just to respond to your last point here:

What worries me more is that Cockroach lacks a systematic mechanism to bound its memory usage (Go heap + Cgo + system).

We've put a lot of effort into CockroachDB's memory monitoring system over the years, and it's not true that we lack a systematic mechanism to bound memory usage. If you're interested to learn about how it works, have a look at this file and the places that it's used throughout the codebase: https://github.com/cockroachdb/cockroach/blob/master/pkg/util/mon/bytes_usage.go

The system allows components of the database to register the memory that they use with a hierarchy of monitors that begin at the SQL operator level and ladder up through queries, sessions, and ultimately to the limit that's governed by --max-sql-memory. This system is also how CockroachDB accomplishes disk spilling for queries that cannot fit into memory.

While simple, this system is powerful enough to prevent the database from running out of memory in most workloads that we test with. The main downside to the system is that it's cooperative: systems within the database must know to register memory, especially memory that's allocated proportionally to any user-input value.

As you've correctly pointed out, the database fails to manage memory properly under the workload that you've been testing with: a very highly-concurrent implementation of a bulk ingest of TPCC data. We've filed issues that will resolve this problem when closed.

However, I do want to point out that the case you're testing is somewhat extreme. Of course, we aspire to make CockroachDB work in every scenario, and the fact that it doesn't in this case is certainly a problem. But it's important to note that this somewhat extreme use case is pretty far from workloads that we see in the wild.

jordanlewis commented 2 years ago

One last note: I want to re-express gratitude for your hard work in filing this detailed, helpful issue. We really appreciate it!

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!