Panic runtime error: index out of range when running 6 node tpc-c without range merging

awoods187 commented 5 years ago

Describe the problem

Node died while running tpc-c on a six node cluster with range merging turned off. In the shell I saw

Error: error in payment: EOF
Error:  exit status 1

To Reproduce

export CLUSTER=andy-base
roachprod create $CLUSTER -n 7 --clouds=aws  --aws-machine-type-ssd=c5d.4xlarge
roachprod run $CLUSTER:1-6 -- 'sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier /dev/nvme1n1 /mnt/data1/; mount | grep /mnt/data1'
roachprod stage $CLUSTER:1-6 cockroach
roachprod stage $CLUSTER:7 workload
roachprod start $CLUSTER:1-6
roachprod sql $CLUSTER:1 -- -e 'SET CLUSTER SETTING kv.range_merge.queue_enabled = false'
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=5000 --db=tpcc"
roachprod run $CLUSTER:7 "./workload run tpcc --ramp=5m --warehouses=3400 --duration=15m --split --scatter {pgurl:1-6}"

Expected behavior Completing tpc-c

Additional data / screenshots

Environment: v2.2.0-alpha.20181217-820-g645c0c9

From log to stderr https://gist.github.com/awoods187/a44b3505f7921aa152d8c0b488afdccc

ERROR: [n5,client=172.31.42.127:50898,user=root,txn=44987ebc] a panic has occurred!
runtime error: index out of range

RaduBerinde commented 5 years ago

Just 🍿

nvanbenschoten commented 5 years ago

Response to https://github.com/cockroachdb/cockroach/issues/34241#issuecomment-471538777:

@tbg Our haphazard approach to deep vs. shallow copies of Transaction objects has always concerned me. I'm suprised it took this long to bite us. I wouldn't be opposed to immediately cleaning up the ownership model for these objects and making deep copies cheap enough to use in the few cases where we actually need a clone. It doesn't seem like there should be too many - maybe just right on the local RPC fast-path and then the few places where Store/Replica hands off a Txn to a background process (e.g. concurrency handling). I also think we could batch allocate a transaction and its ObservedTimestamps to get a deep copy down to the same number of allocations as a shallow copy. That might be worth doing regardless because of this.

It's unclear to me at the moment whether this would be too invasive of a change for this late in the release cycle. A smaller change like what you're suggesting would also do for the current release, although I think the real culprit is the omission of a copy here.

Either way, want me to take over this issue?

tbg commented 5 years ago

Either way, want me to take over this issue?

That'd be great. Happy to review.

I'm suprised it took this long to bite us.

ditto. I had forgotten how haphazard it actually was until I looked at it. I was also disappointed to find out that we won't be able to use the race detector to guard from this class of issues. You need high enough frequency of operations to trigger this reliably and under race everything becomes dirt slow. I had overestimated the efficiency of the race transport in detecting this sort of thing, though the ObservedTimestamps are also treacherous in that the issue doesn't occur unless multiple nodes are touched thanks to the oddities in UpdateObservedTimestamp. (The fact that most of SQL never runs the race tests in the first place isn't helping).

I wouldn't be opposed to immediately cleaning up the ownership model for these objects and making deep copies cheap enough to use in the few cases where we actually need a clone.

Generally sounds good, though I'd still prefer to make a first PR that targets a fix and adds high-level testing (I think @ajwerner's repro should be a nightly roachtest, and generally we'll also want the initial one that Andy ran. These large machines have been tickling a class of issues dormant for a long time, who knows what else is out there).

Once that's in I support rattling that cage a bit if your bandwidth permits.

A smaller change like what you're suggesting would also do for the current release, although I think the real culprit is the omission of a copy here.

Could you elaborate? We're taking a shallow copy of the txn in evaluateBatch (and nil out the ObservedTimestamps), which I think means anything the caller gets back is either something that isn't modified anywhere or is shallow. (I don't have great confidence in that statement, of course, and wish it were more transparent). Cloning were you suggest would add another shallow copy of the txn that ends up on the heap. Ideally, of course, reply.Txn wouldn't be a *Transaction but just a delta that's memory-efficient and automatically owned by the caller when it receives it. Either way, you're going to do the right thing and I look forward to reviewing.

ajwerner commented 5 years ago

You may be interested to know that I just hit this on a cluster composed of 4-core nodes.

andreimatei commented 5 years ago

Either way, want me to take over this issue?

That'd be great. Happy to review.

A clear contract around these things would be gold. Some previous discussions was in #30485 and then I had to revert some crap in #30773 because I introduced some mysterious races. And I never got back to investigating them and reverting the revert. If you also want to take over #30773, you'd get my peer ack.

cockroachdb / cockroach

Panic runtime error: index out of range when running 6 node tpc-c without range merging #34241