Panic non nullable column "job:status" during tpc-c import on 60 node 4 cpu cluster

awoods187 commented 5 years ago

Describe the problem

Two dead nodes during tpc-c import on 60 node 4 cpu cluster with this failure in the CLI:

Andrews-MBP-2:cockroach andrewwoods$ roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"
Error: importing fixture: importing table stock: read tcp 127.0.0.1:42464->127.0.0.1:26257: read: connection reset by peer
Error:  exit status 1

To Reproduce export CLUSTER=andy-60 roachprod create $CLUSTER -n 61 --clouds=aws --aws-machine-type-ssd=c5d.xlarge roachprod run $CLUSTER -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1" roachprod stage $CLUSTER:1-60 cockroach roachprod stage $CLUSTER:61 workload roachprod start $CLUSTER:1-60 --racks=20 -e COCKROACH_ENGINE_MAX_SYNC_DURATION=24h roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"

Expected behavior Restore to complete

Additional data / screenshots This same panic happened on Node 1 and Node 8

panic: Non-nullable column "jobs:status" with no value! Index scanned was "primary" with the index key columns (id) and the values (426018841420595201)

goroutine 45470 [running]:
github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).finalizeRow(0xc01915f4a0, 0x38e8160, 0xc012e2aff0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/fetcher.go:1258 +0xa36
github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).NextRow(0xc01915f4a0, 0x38e8160, 0xc012e2aff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/row/fetcher.go:994 +0x182
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*rowFetcherWrapper).Next(0xc011721900, 0xc007031d40, 0x1, 0x1, 0x1)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:160 +0x4a
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*tableReader).Next(0xc01915f000, 0x38e80a0, 0xc012e2aff0, 0xc007031d80, 0x2c36a00)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:277 +0x4c
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x38e80a0, 0xc007031d80, 0x38f4420, 0xc01915f000, 0x38d9860, 0xc00d64ca80)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:175 +0x35
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc01915f000, 0x38e80a0, 0xc007031d80)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:801 +0x92
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal.func1(0xc00ae70540, 0xc00da24110, 0x0)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:589 +0x67
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:588 +0x342

cockroach.log

Environment: v2.2.0-alpha.20190211-112-g72edf20

awoods187 commented 5 years ago

A third node died 30 minutes after the first two nodes and well after the import failed due to OOM. From sudo dmesg:

[ 1946.769407] Out of memory: Kill process 5826 (cockroach) score 968 or sacrifice child
[ 1946.772013] Killed process 5826 (cockroach) total-vm:11531636kB, anon-rss:7548560kB, file-rss:1164kB

memprof.fraction_system_memory.000000007273963520_2019-02-13T18_30_15.158.zip cockroach.log

nvanbenschoten commented 5 years ago

Here are the top 6 sources of memory allocation in that profile:

(pprof) top
Showing nodes accounting for 1533.01MB, 97.51% of 1572.20MB total
Dropped 266 nodes (cum <= 7.86MB)
Showing top 10 nodes out of 101
      flat  flat%   sum%        cum   cum%
  413.35MB 26.29% 26.29%   413.35MB 26.29%  github.com/cockroachdb/cockroach/pkg/storage.encodeRaftCommand
  328.57MB 20.90% 47.19%   328.57MB 20.90%  github.com/cockroachdb/cockroach/pkg/roachpb.(*AddSSTableRequest).Unmarshal
  304.99MB 19.40% 66.59%   304.99MB 19.40%  github.com/cockroachdb/cockroach/pkg/storage/storagepb.(*ReplicatedEvalResult_AddSSTable).Unmarshal
  166.48MB 10.59% 77.18%   166.48MB 10.59%  github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raftpb.(*Entry).Unmarshal
  117.75MB  7.49% 84.67%   117.75MB  7.49%  github.com/cockroachdb/cockroach/pkg/storage/engine.gobytes
   96.50MB  6.14% 90.81%    96.50MB  6.14%  bytes.makeSlice

That looks pretty standard for a fixture import. We shouldn't be getting into fights with the OOM killer when only 1500 MB are allocated. The dmesg output indicates that we missed a rapid allocation of memory.

awoods187 commented 5 years ago

This repros super reliably...hit it 10 minutes into the import on a new cluster killing 4 nodes.

awoods187 commented 5 years ago

After talking it over with @tbg I think this is in SQL Execution land @jordanlewis

awoods187 commented 5 years ago

@dt any thoughts on the import OOM? (@jordanlewis will take a look at the Jobs table panic)

awoods187 commented 5 years ago

FYI this reproduces everytime (most recently with OOMs)

jordanlewis commented 5 years ago

@yuzefovich please try to reproduce this

danhhz commented 5 years ago

I just repro'd the "Non-nullable column "jobs:status" with no value" too

yuzefovich commented 5 years ago

When running cockroach-v19.1.0-beta.20190304-160-g5fdb938 on 60 node cluster, I'm constantly hitting OOM errors in a sense that the cockroach process is killed. I tried different number of warehouses (10k, 5k, 1k) and hit OOM every time, on 500 warehouses once a node panicked itself due to OOM. After wiping out the cluster and attempting to put a locally built 02/11 alpha on each node, I got scp: cockroach-v2.2.0-alpha.20190211: No space left on device on the node that panicked on 500 warehouses import.

So far I haven't seen the panic in title of this issue. @dt I wonder whether there have been any changes in between 02/11 alpha and 03/04 beta that could explain increase in memory usage while doing imports.

Will try running 02/11 alpha next.

yuzefovich commented 5 years ago

On locally built 02/11 alpha on a 60 node cluster (the same cluster has been wiped and reused for all the runs): 1k warehouses - cockroach process killed. 500 warehouses - import succeeded. 1k warehouses - cockroach process killed on two nodes. 1k warehouses - cockroach process killed on two nodes and one node ran out of disk space. (I think the latter runs lasted longer, i.e. I looked at the cluster later in comparison to the first 1k run.) I still haven't hit the panic.

knz commented 5 years ago

this may be a dup of #35040 which I've been working on recently.

yuzefovich commented 5 years ago

While looking into this with @dt, we found the reason for two out of disk errors I hit last week. What happens is that / is mounted to a small block device, so /tmp lives on that small device as well by default. Whenever oomkiller kills cockroach process because it is using too much memory, oomkiller takes a core dump and attempts to store it within /tmp. That core dump should be on the order of gigabytes in size, but the small block device itself has only around 6.5 GB total, so depending on the size of the core dump it occasionally couldn't fit onto the disk.

The solution is to mount /tmp onto a data device that has a lot more space. It can be done on already existing cluster as roachprod run $CLUSTER "sudo mkdir -p /mnt/data1/tmp && sudo chmod 777 /mnt/data1/tmp && sudo mount -o bind /mnt/data1/tmp /tmp" or added to "no barrier" command like roachprod run $CLUSTER -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1; sudo mkdir -p /mnt/data1/tmp && sudo chmod 777 /mnt/data1/tmp && sudo mount -o bind /mnt/data1/tmp /tmp".

danhhz commented 5 years ago

Oof. Sorry you had to find this as well. I filed this a couple days ago. https://github.com/cockroachdb/cockroach/issues/35621

yuzefovich commented 5 years ago

Finally, the import of 10k warehouses succeeded when I ran SET CLUSTER SETTING kv.import.batch_size='8MB'; on the cluster. The default value is 32MB, so I tried 16MB first but hit OOM described in #35773. Lowering that setting slows down the import, so the whole process took about 2 hours 15 minutes with 8MB. As I didn't encounter the panic described in the issue, I'm guessing that it was fixed by Raphael's patch #35371. In my opinion, we can lower the severity level of this issue and see whether we hit the panic again.

yuzefovich commented 5 years ago

I lowered the priority since I haven't encountered the panic in the title of the issue, but I'm not sure whether it was fixed or not, so let's keep the issue open. OOM discussion will continue in #35773.

jordanlewis commented 5 years ago

Closing as unactionable - Yahor tried and failed to reproduce the panic.

awoods187 commented 5 years ago

I just reproduced this right now on andy-60

jordanlewis commented 5 years ago

This is an OOM during import, without the panic you reported. That's why we had the other issue opened, which is correctly assigned to the Bulk IO team. This does not reproduce the problem that this issue was opened for, so I'm going to close again unless you have counterevidence.

jordanlewis commented 5 years ago

(#35773 was the OOM during import issue)

awoods187 commented 5 years ago

I doublechecked all of the failed nodes and none of them have the panic on Non-nullable column "jobs:status" with no value! Looks like it has been solved!

irfansharif commented 4 years ago

Looks like it has been solved!

Don't think it has. https://github.com/cockroachdb/cockroach/issues/36851#issuecomment-611474773.

cockroachdb / cockroach

Panic non nullable column "job:status" during tpc-c import on 60 node 4 cpu cluster #34878