cluster crashing from OOM

dianasaur323 commented 7 years ago

Copying in issue encountered on forum. Response and triage needed.

Below are the steps to reproduce crashing the cluster 100% of the time with oom 5 Machines, m4.2xlarge (31GB mem each)

startup options: cockroach start --store=path=/data,size=256GB,attrs=ssd --log-dir /data/log --insecure

Machines run at ~2GB ram utilization during inserts. Ram usage explodes when running last example which is combination of selects and inserts, crashed most or all nodes. Irrecoverable, restarting the failed node crashes with OOM again.

Other minor details:

We are using dns roundrobin to hit all 5 nodes Version we are using:

root@crdb1:/home/ubuntu# cockroach version
Build Tag: beta-20170309
Build Time: 2017/03/09 16:31:10
Distribution: CCL
Platform: linux amd64
Go Version: go1.8
C Compiler: gcc 6.3.0
Build SHA-1: dd3a574da6734faebf523d7cc2df244ab662527d
Build Type: release

==========begin oom message=============

[102694.044518] cockroach invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
[102694.044521] cockroach cpuset=/ mems_allowed=0
[102694.044527] CPU: 5 PID: 2800 Comm: cockroach Not tainted 4.4.0-66-generic #87-Ubuntu
[102694.044528] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
[102694.044529] 0000000000000286 0000000007bb746f ffff8807e9c5baf0 ffffffff813f8083
[102694.044532] ffff8807e9c5bca8 ffff8807e68a4600 ffff8807e9c5bb60 ffffffff8120b0fe
[102694.044534] ffffffff81cd63bf 0000000000000000 ffffffff81e677e0 0000000000000206
[102694.044536] Call Trace:
[102694.044542] [] dump_stack+0x63/0x90
[102694.044546] [] dump_header+0x5a/0x1c5
[102694.044550] [] oom_kill_process+0x202/0x3c0
[102694.044551] [] out_of_memory+0x219/0x460
[102694.044554] [] _allocpages_slowpath.constprop.88+0x938/0xad0
[102694.044556] [] _allocpages_nodemask+0x286/0x2a0
[102694.044560] [] alloc_pages_vma+0xbe/0x240
[102694.044565] [] handle_mm_fault+0x1491/0x1820
[102694.044570] [] _dopage_fault+0x197/0x400
[102694.044571] [] do_page_fault+0x22/0x30
[102694.044575] [] page_fault+0x28/0x30
[102694.044576] Mem-Info:
[102694.044580] active_anon:8143668 inactive_anon:2096 isolated_anon:0
active_file:0 inactive_file:0 isolated_file:1
unevictable:914 dirty:35 writeback:0 unstable:0
slab_reclaimable:5159 slab_unreclaimable:7044
mapped:1726 shmem:2220 pagetables:16683 bounce:0
free:38025 free_pcp:236 free_cma:0
[102694.044583] Node 0 DMA free:15904kB min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[102694.044587] lowmem_reserve[]: 0 3745 32158 32158 32158
[102694.044590] Node 0 DMA32 free:116276kB min:2668kB low:3332kB high:4000kB active_anon:3700980kB inactive_anon:20kB active_file:12kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3915776kB managed:3835112kB mlocked:0kB dirty:12kB writeback:0kB mapped:232kB shmem:84kB slab_reclaimable:2068kB slab_unreclaimable:2020kB kernel_stack:432kB pagetables:9712kB unstable:0kB bounce:0kB free_pcp:372kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:960 all_unreclaimable? yes
[102694.044594] lowmem_reserve[]: 0 0 28413 28413 28413
[102694.044596] Node 0 Normal free:19920kB min:20244kB low:25304kB high:30364kB active_anon:28873692kB inactive_anon:8364kB active_file:0kB inactive_file:0kB unevictable:3656kB isolated(anon):0kB isolated(file):4kB present:29753344kB managed:29095280kB mlocked:3656kB dirty:128kB writeback:0kB mapped:6672kB shmem:8796kB slab_reclaimable:18568kB slab_unreclaimable:26156kB kernel_stack:3824kB pagetables:57020kB unstable:0kB bounce:0kB free_pcp:572kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5100 all_unreclaimable? yes
[102694.044600] lowmem_reserve[]: 0 0 0 0 0
[102694.044603] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15904kB
[102694.044612] Node 0 DMA32: 66*4kB (UME) 61*8kB (UME) 363*16kB (UME) 525*32kB (UME) 85*64kB (UME) 19*128kB (UME) 10*256kB (UME) 1*512kB (E) 4*1024kB (UME) 2*2048kB (UE) 18*4096kB (M) = 116224kB
[102694.044621] Node 0 Normal: 344*4kB (UME) 213*8kB (UME) 214*16kB (UME) 139*32kB (UME) 82*64kB (UME) 32*128kB (UE) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20296kB
[102694.044630] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[102694.044631] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[102694.044632] 2775 total pagecache pages
[102694.044634] 0 pages in swap cache
[102694.044635] Swap cache stats: add 0, delete 0, find 0/0
[102694.044636] Free swap = 0kB
[102694.044636] Total swap = 0kB
[102694.044637] 8421277 pages RAM
[102694.044638] 0 pages HighMem/MovableOnly
[102694.044639] 184703 pages reserved
[102694.044640] 0 pages cma reserved
[102694.044640] 0 pages hwpoisoned
[102694.044641] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[102694.044647] [ 427] 0 427 10964 3042 24 3 0 0 systemd-journal
[102694.044649] [ 487] 0 487 23693 164 17 3 0 0 lvmetad
[102694.044651] [ 518] 0 518 10790 764 24 3 0 -1000 systemd-udevd
[102694.044653] [ 894] 100 894 25081 426 20 3 0 0 systemd-timesyn
[102694.044654] [ 1065] 0 1065 4031 505 12 3 0 0 dhclient
[102694.044656] [ 1195] 0 1195 1306 30 8 3 0 0 iscsid
[102694.044658] [ 1196] 0 1196 1431 880 8 3 0 -17 iscsid
[102694.044659] [ 1198] 104 1198 64099 631 27 3 0 0 rsyslogd
[102694.044661] [ 1200] 0 1200 1100 293 8 3 0 0 acpid
[102694.044663] [ 1202] 0 1202 6932 495 17 3 0 0 cron
[102694.044664] [ 1204] 107 1204 10726 487 25 3 0 -900 dbus-daemon
[102694.044666] [ 1215] 0 1215 58693 314 16 3 0 0 lxcfs
[102694.044668] [ 1225] 0 1225 68653 612 36 3 0 0 accounts-daemon
[102694.044669] [ 1230] 0 1230 16380 591 35 3 0 -1000 sshd
[102694.044671] [ 1232] 0 1232 7170 539 18 3 0 0 systemd-logind
[102694.044673] [ 1234] 0 1234 6511 395 18 3 0 0 atd
[102694.044674] [ 1236] 0 1236 54105 1957 30 5 0 0 snapd
[102694.044676] [ 1254] 0 1254 3344 37 11 3 0 0 mdadm
[102694.044677] [ 1264] 0 1264 69295 575 38 3 0 0 polkitd
[102694.044679] [ 1337] 0 1337 4903 388 14 3 0 0 irqbalance
[102694.044680] [ 1349] 0 1349 3619 404 12 3 0 0 agetty
[102694.044682] [ 1351] 0 1351 3665 345 12 3 0 0 agetty
[102694.044684] [ 1450] 1000 1450 11312 530 26 3 0 0 systemd
[102694.044685] [ 1454] 1000 1454 15318 499 33 3 0 0 (sd-pam)
[102694.044687] [ 2796] 0 2796 8235716 8126137 15953 37 0 0 cockroach
[102694.044689] [ 5501] 0 5501 23842 676 51 3 0 0 sshd
[102694.044690] [ 5556] 1000 5556 23842 293 49 3 0 0 sshd
[102694.044692] [ 5559] 1000 5559 5348 899 15 3 0 0 bash
[102694.044694] [ 6103] 0 6103 23842 648 49 3 0 0 sshd
[102694.044695] [ 6157] 1000 6157 23842 389 47 3 0 0 sshd
[102694.044697] [ 6158] 1000 6158 5347 884 15 3 0 0 bash
[102694.044698] [ 6172] 0 6172 13940 548 32 3 0 0 sudo
[102694.044700] [ 6173] 0 6173 12751 520 29 3 0 0 su
[102694.044701] [ 6174] 0 6174 4991 520 14 3 0 0 bash
[102694.044703] [ 6184] 0 6184 49993 8167 61 5 0 0 cockroach
[102694.044704] Out of memory: Kill process 2796 (cockroach) score 958 or sacrifice child
[102694.049713] Killed process 2796 (cockroach) total-vm:32942864kB, anon-rss:32504548kB, file-rss:0kB

==========end message=============

// how to reproduce
// set up schema
cockroach sql -e 'CREATE DATABASE test'
cockroach user set crdb
cockroach sql -e 'GRANT ALL ON DATABASE test TO crdb'
cockroach sql --database=test --user=crdb -e 'CREATE TABLE lorem (id INT PRIMARY KEY, ipsum TEXT)'
cockroach sql --database=test --user=crdb -e 'CREATE TABLE digits (id INT PRIMARY KEY, digit INT)'

// run against below script
$> time ab -c 250 -n 400000 http:///digits-rr.php

//dump a bunch of random ints into table
//digits-rr.php
<?php
error_reporting(E_ALL);

try {
$conn = new PDO("pgsql:dbname=test;host=;port=26257",
"crdb",
null,
[ PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
PDO::ATTR_EMULATE_PREPARES => true ]);

$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");

} catch (Exception $e) {
print $e->getMessage() . "\r\n";
exit(1);
}
?>

// dump lots of text into table
//test-multiple-insert.php
$> time ab -c 250 -n 400000 http:///text-multiple-insert.php

<?php
error_reporting(E_ALL);

try {
$conn = new PDO("pgsql:dbname=test;host=;port=26257",
"crdb",
null,
[ PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
PDO::ATTR_EMULATE_PREPARES => true ]);
$data = "insert into test.lorem values(unique_rowid(), '

Wants pawn term, dare worsted ladle gull hoe lift wetter murder inner ladle cordage,
honor itch offer lodge dock florist. Disk ladle gull orphan worry ladle cluck wetter
putty ladle rat hut, an fur disk raisin pimple colder Ladle Rat Rotten Hut.

Wan moaning, Rat Rotten Huts murder colder inset, Ladle Rat Rotten Hut, heresy
ladle basking winsome burden barter an shirker cockles. Tick disk ladle basking tutor
cordage offer groin-murder hoe lifts honor udder site offer florist. Shaker lake!
Dun stopper laundry wrote! An yonder nor sorghum-stenches, dun stopper torque wet strainers!

Hoe-cake, murder, resplendent Ladle Rat Rotten Hut, an tickle ladle basking an
stuttered oft. Honor wrote tutor cordage offer groin-murder, Ladle Rat Rotten Hut
mitten anomalous woof. Wail, wail, wail! set disk wicket woof, Evanescent Ladle
Rat Rotten Hut! Wares are putty ladle gull goring wizard ladle basking?

Armor goring tumor groin-murders, reprisal ladle gull. Grammars seeking bet.
Armor ticking arson burden barter an shirker cockles.

O hoe! Heifer blessing woke, setter wicket woof, butter taught tomb shelf, Oil
tickle shirt court tutor cordage offer groin-murder. Oil ketchup wetter letter, an den - O bore!

Soda wicket woof tucker shirt court, an whinney retched a cordage offer groin-murder,
picked inner widow, an sore debtor pore oil worming worse lion inner bet. Inner flesh,
disk abdominal woof lipped honor bet an at a rope. Den knee poled honor groin-murders
nut cup an gnat-gun, any curdled dope inner bet.

Inner ladle wile, Ladle Rat Rotten Hut a raft attar cordage, an ranker dough belle.
Comb ink, sweat hard, setter wicket woof, disgracing is verse. Ladle Rat Rotten
Hut entity bet rum an stud buyer groin-murders bet.

O Grammar! crater ladle gull, Wood bag icer gut! A nervous sausage bag ice!

Battered lucky chew whiff, doling, whiskered disk ratchet woof, wetter wicket small.

O Grammar, water bag noise! A nervous sore suture anomolous prognosis!

Battered small your whiff, insert a woof, ants mouse worse waddling.

O Grammar, water bag mousy gut! A nervous sore suture bag mouse!

Daze worry on-forger-nut gulls lest warts. Oil offer sodden, thoroughing offer
carvers an sprinkling otter bet, disk curl and bloat-thursday woof ceased pore Ladle Rat Rotten Hut an garbled erupt.

Mural: Yonder nor sorghum stenches shut ladle gulls stopper torque wet strainers.

');";

$conn->exec($data);
$conn->exec($data);
$conn->exec($data);
$conn->exec($data);
$conn->exec($data);

} catch (Exception $e) {
print $e->getMessage() . "\r\n";
exit(1);
}
?>

This breaks the cluster everytime - multinode failure
$> time ab -c 250 -n 400000 http:///digits-mixed.php

// digits-mixed.php
<?php
error_reporting(E_ALL);

try {
$conn = new PDO("pgsql:dbname=test;host=;port=26257",
"crdb",
null,
[ PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
PDO::ATTR_EMULATE_PREPARES => true ]);

$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");

$conn->exec("SELECT * FROM test.lorem LIMIT 10;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");

$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");

$conn->exec("SELECT * FROM test.lorem LIMIT 5;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");

$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");

$conn->exec("SELECT * FROM test.lorem LIMIT 5;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");

$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");
$conn->exec("insert into test.digits values(unique_rowid(), " . random_int(0, PHP_INT_MAX) . ");");

$conn->exec("SELECT * FROM test.lorem LIMIT 5;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");
$conn->exec("SELECT * FROM test.lorem LIMIT 5;");

} catch (Exception $e) {
print $e->getMessage() . "\r\n";
exit(1);
}
?>

petermattis commented 7 years ago

I'll take a look today. Does anyone know PHP well enough to extract the reproduction into a series of SQL statements?

Update: I've figured out the PHP stuff. Still trying to reproduce the memory leak.

petermattis commented 7 years ago

I was able to reproduce what looks like a memory leak when running digits-mixed.php:

Haven't done any debugging to figure out what is going, but this is reproducible.

petermattis commented 7 years ago

The above repro was on top of https://github.com/cockroachdb/cockroach/pull/14489 which exacerbates the problem, but it still seems to happen on master. The problem is in the construction of the cacheRequest inserted into the timestamp cache. Specifically https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/replica.go#L1595-L1610:

            case *roachpb.ScanRequest:
                resp := br.Responses[i].GetInner().(*roachpb.ScanResponse)
                if ba.Header.MaxSpanRequestKeys != 0 &&
                    ba.Header.MaxSpanRequestKeys == int64(len(resp.Rows)) {
                    // If the scan requested a limited number of results and we hit the
                    // limit, truncate the span of keys to add to the timestamp cache
                    // to those that were returned.
                    //
                    // NB: We need to include any key read by the operation, even if
                    // not returned, in the span added to the timestamp cache in order
                    // to prevent phantom read anomalies. That means we can only
                    // perform this truncate if the scan requested a limited number of
                    // results and we hit that limit.
                    header.EndKey = resp.Rows[len(resp.Rows)-1].Key.Next()
                }
                cr.reads = append(cr.reads, header)

For a ScanRequest, if we limited the scan then the cacheRequest contains a span which holds on to the last row in the ScanResponse (assuming Next didn't make a copy of the key which is the common case). Note that the keys and the values in the ScanResponse are allocated using a ByteAllocator which reduces allocations but has the drawback that a larger chunk of memory will be pinned if any portion of the ByteAllocator buffer is pinned.

There is a relatively straightforward change to make a copy of header.EndKey, but I'd like to think about this over night before committing to that direction.

bdarnell commented 7 years ago

Making a copy here seems like the right move to me. I wonder if we could zero-fill ByteAllocator buffers when they should go out of scope to help detect cases like this. I don't think there's a good time to do that, though.

petermattis commented 7 years ago

I wonder if we could zero-fill ByteAllocator buffers when they should go out of scope to help detect cases like this. I don't think there's a good time to do that, though.

Yeah, that would be good, but I don't see where to do it either.

cockroachdb / cockroach

cluster crashing from OOM #14457