cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.77k stars 3.76k forks source link

storage: panic in block cache refcnt.{acquire,release} on arm64 #84971

Open cockroach-teamcity opened 2 years ago

cockroach-teamcity commented 2 years ago

This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry link: https://sentry.io/organizations/cockroach-labs/issues/3451229717/?referrer=webhooks_plugin

Panic message:

panic.go:965: panic: × (1) attached stack trace -- stack trace: | runtime.gopanic | /usr/local/go/src/runtime/panic.go:965 | github.com/cockroachdb/pebble/internal/cache.(refcnt).release | /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/refcnt_normal.go:38 | github.com/cockroachdb/pebble/internal/cache.(Value).release | /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/value.go:43 | github.com/cockroachdb/pebble/internal/cache.(Cache).Free | /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:706 | github.com/cockroachdb/pebble/sstable.(Reader).readBlock | /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:1946 | github.com/cockroachdb/pebble/sstable.(singleLevelIterator).loadBlock | /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:292 | github.com/cockroachdb/pebble/sstable.(singleLevelIterator).skipForward | /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:733 | github.com/cockroachdb/pebble/sstable.(singleLevelIterator).Next | /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:701 | github.com/cockroachdb/pebble.(levelIter).Next | /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/level_iter.go:542 | github.com/cockroachdb/pebble.(mergingIter).nextEntry | /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/merging_iter.go:506 | github.com/cockroachdb/pebble.(mergingIter).Next | /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/merging_iter.go:1016 | github.com/cockroachdb/pebble.(Iterator).nextUserKey | /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/iterator.go:288 | github.com/cockroachdb/pebble.(Iterator).findNextEntry | /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/iterator.go:245 | github.com/cockroachdb/pebble.(Iterator).SeekGEWithLimit | /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/iterator.go:647 | github.com/cockroachdb/pebble.(Iterator).SeekGE | /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/iterator.go:598 | github.com/cockroachdb/cockroach/pkg/storage.(pebbleIterator).SeekGE | /go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:265 | github.com/cockroachdb/cockroach/pkg/storage.ComputeStatsForRange | /go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:3972 | github.com/cockroachdb/cockroach/pkg/storage.(pebbleIterator).ComputeStats | /go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:578 | github.com/cockroachdb/cockroach/pkg/kv/kvserver.ComputeRaftLogSize | /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:1934 | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(raftLogQueue).process | /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_log_queue.go:586 | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(baseQueue).processReplica.func1 | /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/queue.go:959 | github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout | /go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:89 | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(baseQueue).processReplica | /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/queue.go:918 | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(baseQueue).processLoop.func2.1 | /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/queue.go:838 | github.com/cockroachdb/cockroach/pkg/util/stop.(Stopper).RunAsyncTaskEx.func2 | /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:442 | runtime.goexit | /usr/local/go/src/runtime/asm_amd64.s:1371 Wraps: (2) panic: × Error types: (1) withstack.withStack (2) errutil.leafError -- report composition: errutil.leafError: panic: × panic.go:965: *withstack.withStack (top exception)

Stacktrace (expand for inline code snippets): /usr/local/go/src/runtime/panic.go#L964-L966 in runtime.gopanic https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/vendor/github.com/cockroachdb/pebble/internal/cache/refcnt_normal.go#L37-L39 in github.com/cockroachdb/pebble/internal/cache.(*refcnt).release https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/vendor/github.com/cockroachdb/pebble/internal/cache/value.go#L42-L44 in github.com/cockroachdb/pebble/internal/cache.(*Value).release https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go#L705-L707 in github.com/cockroachdb/pebble/internal/cache.(*Cache).Free https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/vendor/github.com/cockroachdb/pebble/sstable/reader.go#L1945-L1947 in github.com/cockroachdb/pebble/sstable.(*Reader).readBlock https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/vendor/github.com/cockroachdb/pebble/sstable/reader.go#L291-L293 in github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).loadBlock https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/vendor/github.com/cockroachdb/pebble/sstable/reader.go#L732-L734 in github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).skipForward https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/vendor/github.com/cockroachdb/pebble/sstable/reader.go#L700-L702 in github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).Next https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/vendor/github.com/cockroachdb/pebble/level_iter.go#L541-L543 in github.com/cockroachdb/pebble.(*levelIter).Next https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/vendor/github.com/cockroachdb/pebble/merging_iter.go#L505-L507 in github.com/cockroachdb/pebble.(*mergingIter).nextEntry https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/vendor/github.com/cockroachdb/pebble/merging_iter.go#L1015-L1017 in github.com/cockroachdb/pebble.(*mergingIter).Next https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/vendor/github.com/cockroachdb/pebble/iterator.go#L287-L289 in github.com/cockroachdb/pebble.(*Iterator).nextUserKey https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/vendor/github.com/cockroachdb/pebble/iterator.go#L244-L246 in github.com/cockroachdb/pebble.(*Iterator).findNextEntry https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/vendor/github.com/cockroachdb/pebble/iterator.go#L646-L648 in github.com/cockroachdb/pebble.(*Iterator).SeekGEWithLimit https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/vendor/github.com/cockroachdb/pebble/iterator.go#L597-L599 in github.com/cockroachdb/pebble.(*Iterator).SeekGE https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/pkg/storage/pebble_iterator.go#L264-L266 in pkg/storage.(*pebbleIterator).SeekGE https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/pkg/storage/mvcc.go#L3971-L3973 in pkg/storage.ComputeStatsForRange https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/pkg/storage/pebble_iterator.go#L577-L579 in pkg/storage.(*pebbleIterator).ComputeStats https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/pkg/kv/kvserver/replica_raft.go#L1933-L1935 in pkg/kv/kvserver.ComputeRaftLogSize https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/pkg/kv/kvserver/raft_log_queue.go#L585-L587 in pkg/kv/kvserver.(*raftLogQueue).process https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/pkg/kv/kvserver/queue.go#L958-L960 in pkg/kv/kvserver.(*baseQueue).processReplica.func1 https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/pkg/util/contextutil/context.go#L88-L90 in pkg/util/contextutil.RunWithTimeout https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/pkg/kv/kvserver/queue.go#L917-L919 in pkg/kv/kvserver.(*baseQueue).processReplica https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/pkg/kv/kvserver/queue.go#L837-L839 in pkg/kv/kvserver.(*baseQueue).processLoop.func2.1 https://github.com/cockroachdb/cockroach/blob/cad00009ffee952065c51687da6b1f71e44a7032/pkg/util/stop/stopper.go#L441-L443 in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2 /usr/local/go/src/runtime/asm_amd64.s#L1370-L1372 in runtime.goexit
/usr/local/go/src/runtime/panic.go in runtime.gopanic at line 965
vendor/github.com/cockroachdb/pebble/internal/cache/refcnt_normal.go in github.com/cockroachdb/pebble/internal/cache.(*refcnt).release at line 38
vendor/github.com/cockroachdb/pebble/internal/cache/value.go in github.com/cockroachdb/pebble/internal/cache.(*Value).release at line 43
vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go in github.com/cockroachdb/pebble/internal/cache.(*Cache).Free at line 706
vendor/github.com/cockroachdb/pebble/sstable/reader.go in github.com/cockroachdb/pebble/sstable.(*Reader).readBlock at line 1946
vendor/github.com/cockroachdb/pebble/sstable/reader.go in github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).loadBlock at line 292
vendor/github.com/cockroachdb/pebble/sstable/reader.go in github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).skipForward at line 733
vendor/github.com/cockroachdb/pebble/sstable/reader.go in github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).Next at line 701
vendor/github.com/cockroachdb/pebble/level_iter.go in github.com/cockroachdb/pebble.(*levelIter).Next at line 542
vendor/github.com/cockroachdb/pebble/merging_iter.go in github.com/cockroachdb/pebble.(*mergingIter).nextEntry at line 506
vendor/github.com/cockroachdb/pebble/merging_iter.go in github.com/cockroachdb/pebble.(*mergingIter).Next at line 1016
vendor/github.com/cockroachdb/pebble/iterator.go in github.com/cockroachdb/pebble.(*Iterator).nextUserKey at line 288
vendor/github.com/cockroachdb/pebble/iterator.go in github.com/cockroachdb/pebble.(*Iterator).findNextEntry at line 245
vendor/github.com/cockroachdb/pebble/iterator.go in github.com/cockroachdb/pebble.(*Iterator).SeekGEWithLimit at line 647
vendor/github.com/cockroachdb/pebble/iterator.go in github.com/cockroachdb/pebble.(*Iterator).SeekGE at line 598
pkg/storage/pebble_iterator.go in pkg/storage.(*pebbleIterator).SeekGE at line 265
pkg/storage/mvcc.go in pkg/storage.ComputeStatsForRange at line 3972
pkg/storage/pebble_iterator.go in pkg/storage.(*pebbleIterator).ComputeStats at line 578
pkg/kv/kvserver/replica_raft.go in pkg/kv/kvserver.ComputeRaftLogSize at line 1934
pkg/kv/kvserver/raft_log_queue.go in pkg/kv/kvserver.(*raftLogQueue).process at line 586
pkg/kv/kvserver/queue.go in pkg/kv/kvserver.(*baseQueue).processReplica.func1 at line 959
pkg/util/contextutil/context.go in pkg/util/contextutil.RunWithTimeout at line 89
pkg/kv/kvserver/queue.go in pkg/kv/kvserver.(*baseQueue).processReplica at line 918
pkg/kv/kvserver/queue.go in pkg/kv/kvserver.(*baseQueue).processLoop.func2.1 at line 838
pkg/util/stop/stopper.go in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2 at line 442
/usr/local/go/src/runtime/asm_amd64.s in runtime.goexit at line 1371
Tag Value
Cockroach Release v21.2.8
Cockroach SHA: cad00009ffee952065c51687da6b1f71e44a7032
Platform linux amd64
Distribution CCL
Environment v21.2.8
Command server
Go Version ``
# of CPUs
# of Goroutines

Jira issue: CRDB-17979

nicktrav commented 2 years ago

The elided error message is pebble: inconsistent reference count (source). Looks like in this case the count became negative.

This is going to be tricky to track down without additional logs / artifacts.

jbowens commented 1 year ago

@sumeerbhola — mind triaging this, or closing it out if it looks like memory corruption

sumeerbhola commented 1 year ago

This does look like memory corruption.

The error is originating in a decompression failure path here https://github.com/cockroachdb/vendored/blob/29bf2554d8835ea730b3a506fd72528d0850d8b1/github.com/cockroachdb/pebble/sstable/reader.go#L1946 The v ought to have a Value.ref of 1, which is what it is created with earlier in the function. There are some preceding calls to r.opts.Cache.Free(v) but they return after that, so can't fall through to this code and decrement ref again (which would cause it to be -1 and result in the panic). There is also the if-block corresponding to this else, that decrements the ref, but that is not an issue either.

sumeerbhola commented 1 year ago

I also changed the Pebble code to trigger that error path on all our tests (which of course fail), but there is no panic.

lbguilherme commented 1 year ago

I got this error on a development single-node cluster v22.2.3. It started after simply turning it on with an empty disk and waiting a few minutes. It panics and then enters in a crash loop trying to restart (Kubernetes). I didn't connect to the cluster or wrote anything to it.

It was running in a ARM machine with Oracle Linux 8.

Panic log:

I230125 17:13:52.201673 1 util/log/file_sync_buffer.go:238 ⋮ [config]   file created at: 2023/01/25 17:13:52
I230125 17:13:52.201690 1 util/log/file_sync_buffer.go:238 ⋮ [config]   running on machine: ‹cockroachdb-0›
I230125 17:13:52.201697 1 util/log/file_sync_buffer.go:238 ⋮ [config]   binary: CockroachDB CCL v22.2.3 (aarch64-unknown-linux-gnu, built 2023/01/23 19:38:24, go1.19.1)
I230125 17:13:52.201705 1 util/log/file_sync_buffer.go:238 ⋮ [config]   arguments: [‹/cockroach/cockroach› ‹start-single-node› ‹--log=sinks: {stderr: {filter: INFO, exit-on-error: false, no-color: tr
ue, auditable: false}}› ‹--insecure› ‹--advertise-host› ‹cockroachdb-0.cockroachdb-internal.default.svc.cluster.local› ‹--http-addr› ‹0.0.0.0› ‹--cache› ‹512MiB› ‹--max-sql-memory› ‹512MiB›]
I230125 17:13:52.201770 1 util/log/file_sync_buffer.go:238 ⋮ [config]   log format (utf8=✓): crdb-v2
I230125 17:13:52.201774 1 util/log/file_sync_buffer.go:238 ⋮ [config]   line format: [IWEF]yymmdd hh:mm:ss.uuuuuu goid [chan@]file:line redactionmark \[tags\] [counter] msg
I230125 17:13:52.201371 1 util/log/flags.go:211  [-] 1  stderr capture started
panic: pebble: inconsistent reference count: -1

goroutine 67 [running]:
github.com/cockroachdb/pebble/internal/cache.(*refcnt).release(0x0?)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/refcnt_normal.go:39 +0xc0
github.com/cockroachdb/pebble/internal/cache.(*Value).release(0xfffce5c79e00)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/value.go:43 +0x2c
github.com/cockroachdb/pebble/internal/cache.(*entry).setValue(0x40009fcc30?, 0xfffce5e21080?)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/entry.go:141 +0x68
github.com/cockroachdb/pebble/internal/cache.(*shard).metaDel(0x4002a00c20, 0xfffce5e21320)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/clockpro.go:367 +0x34
github.com/cockroachdb/pebble/internal/cache.(*shard).metaEvict(0x4002a00b50?, 0xfffce5e21320)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/clockpro.go:470 +0x98
github.com/cockroachdb/pebble/internal/cache.(*shard).EvictFile(0x4002a00c20, 0x4000d8a420?, 0x41d0ba0?)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/clockpro.go:260 +0xbc
github.com/cockroachdb/pebble/internal/cache.(*Cache).EvictFile(...)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/clockpro.go:764
github.com/cockroachdb/pebble.(*tableCacheShard).evict(0x4000d8a420, 0x163, 0x400071e188, 0x0)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:786 +0x268
github.com/cockroachdb/pebble.(*tableCacheContainer).evict(0x41d0ba0?, 0x4008b10798?)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:152 +0x40
github.com/cockroachdb/pebble.(*DB).doDeleteObsoleteFiles(0x40021a3200, 0x2)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:3096 +0x7bc
github.com/cockroachdb/pebble.(*DB).deleteObsoleteFiles(0x40021a3200, 0x40003cc5b0?, 0x7?)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2995 +0x38
github.com/cockroachdb/pebble.(*DB).compact1(0x40021a3200, 0x4000527600, 0x0)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2220 +0x540
github.com/cockroachdb/pebble.(*DB).compact.func1({0x5a784e0, 0x4000943f80})
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2151 +0xd8
runtime/pprof.Do({0x5a78470?, 0x400011a000?}, {{0x40000b2440?, 0xf6f6f7f6f5f6f5f4?, 0xf9f9f8f8f8f7f7f8?}}, 0x400136af78)
        GOROOT/src/runtime/pprof/runtime.go:40 +0x80
github.com/cockroachdb/pebble.(*DB).compact(0xffffffffffffffff?, 0xffffffffffffffff?, 0xffffffffffffffff?)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2148 +0x7c
created by github.com/cockroachdb/pebble.(*DB).maybeScheduleCompactionPicker
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:1869 +0x490
lbguilherme commented 1 year ago

Should I open a new issue?

sumeerbhola commented 1 year ago

Was it the same panic when it was crash looping?

lbguilherme commented 1 year ago

Yes. It seems that it runs for some time, then crashes and restarts, then runs for some time again and repeats. From the logs it seems to be caused by a "stats compaction job" that is scheduled to run.

I have tried with cockroachdb/cockroach:arm64-v22.2.3 and with an unofficial arm64 build from the previous release christoofar/cockroachdb-arm64:21.2.3 on k8s. 22.2 seems to restart almost immediatelly as soon as it starts, in loop. 21.2 seems to take a while before crashing and looping. Perhaps something about the frequency of this job changed?

I tried erasing the storage disk and starting fresh, but that didn't solve the problem. Very odd.

This is a Kubernetes cluster running on Oracle Cloud, with their Ampere CPU.

This is the panic log from 21.2.3:

I230125 19:09:46.798241 413 jobs/job_scheduler.go:179 ⋮ [n1] 761  Starting job for schedule 834265962162159617 (‹"sql-stats-compaction"›); scheduled to run at 2023-01-25 19:00:00 +0000 UTC; next run scheduled for 2023-01-25 20:00:00 +0000 UTC
I230125 19:09:52.955709 10999 jobs/registry.go:1170 ⋮ [n1] 762  AUTO CREATE STATS job 834284398007025665: stepping through state running with error: <nil>
I230125 19:09:54.888868 411 jobs/adopt.go:106 ⋮ [-] 763  claimed 1 jobs
I230125 19:09:54.899317 11036 jobs/adopt.go:247 ⋮ [-] 764  job 834284377865814017: resuming execution
I230125 19:09:54.901200 11039 jobs/registry.go:1170 ⋮ [n1] 765  AUTO SQL STATS COMPACTION job 834284377865814017: stepping through state running with error: <nil>
I230125 19:09:54.904969 11039 sql/compact_sql_stats.go:42 ⋮ [n1,job=834284377865814017] 766  starting sql stats compaction job
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767  a panic has occurred!
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +panic: ‹pebble: inconsistent reference count: -1›
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +(1) attached stack trace
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  -- stack trace:
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | runtime.gopanic
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /usr/local/go/src/runtime/panic.go:1038
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send.func1
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:95
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | runtime.gopanic
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /usr/local/go/src/runtime/panic.go:1052
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble/internal/cache.(*refcnt).release
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/refcnt_normal.go:38
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble/internal/cache.(*Value).release
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/value.go:43
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble/internal/cache.(*entry).setValue
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/entry.go:141
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble/internal/cache.(*shard).runHandCold
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:415
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble/internal/cache.(*shard).evict
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:402
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble/internal/cache.(*shard).metaAdd
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:280
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble/internal/cache.(*shard).Set
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:141
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble/internal/cache.(*Cache).Set
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:658
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble/sstable.(*Reader).readBlock
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:1965
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).loadBlock
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:292
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).seekGEHelper
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:466
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).SeekGE
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:404
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble.(*levelIter).SeekGE
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/level_iter.go:392
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble.(*mergingIter).seekGE
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/merging_iter.go:832
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble.(*mergingIter).SeekGE
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/merging_iter.go:887
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble.(*Iterator).SeekGEWithLimit
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/iterator.go:644
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/pebble.(*Iterator).SeekGE
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/iterator.go:598
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).SeekGE
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:265
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingIter).SeekGE
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/storage/intent_interleaving_iter.go:305
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).iterSeek
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_mvcc_scanner.go:817
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).nextKey
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_mvcc_scanner.go:596
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).advanceKey
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_mvcc_scanner.go:662
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).addAndAdvance
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_mvcc_scanner.go:717
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).getAndAdvance
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_mvcc_scanner.go:345
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).scan
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_mvcc_scanner.go:231
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/cockroach/pkg/storage.mvccScanToBytes
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2449
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/cockroach/pkg/storage.MVCCScanToBytes
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2666
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval.Scan
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval/cmd_scan.go:56
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateCommand
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +  |    /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_evaluate.go:507
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +Wraps: (2) panic: ‹pebble: inconsistent reference count: -1›
E230125 19:09:54.932765 11003 1@util/log/logcrash/crash_reporting.go:140 ⋮ [n1] 767 +Error types: (1) *withstack.withStack (2) *errutil.leafError
lbguilherme commented 1 year ago

I ran another experiment to extract logs. This is cockroachdb/cockroach:arm64-v22.2.3 running on a Oracle Cloud Kubernetes cluster with a Ampere machine (ARM64):

I created the statefulset on an empty disk and didn't connect to it in any way. After about 1 hour it crashed 6 times. All crashes shows the same panic message.

Here is the container output:

I230126 15:39:40.661517 1 1@cli/log_flags.go:200 ⋮ [n?] 1  using explicit logging configuration:
I230126 15:39:40.661517 1 1@cli/log_flags.go:200 ⋮ [n?] 1 +‹sinks: {stderr: {filter: INFO}}›
W230126 15:39:40.661779 1 1@cli/start.go:1210 ⋮ [n?] 2  ALL SECURITY CONTROLS HAVE BEEN DISABLED!
W230126 15:39:40.661779 1 1@cli/start.go:1210 ⋮ [n?] 2 +
W230126 15:39:40.661779 1 1@cli/start.go:1210 ⋮ [n?] 2 +This mode is intended for non-production testing only.
W230126 15:39:40.661779 1 1@cli/start.go:1210 ⋮ [n?] 2 +
W230126 15:39:40.661779 1 1@cli/start.go:1210 ⋮ [n?] 2 +In this mode:
W230126 15:39:40.661779 1 1@cli/start.go:1210 ⋮ [n?] 2 +- Your cluster is open to any client that can access ‹any of your IP addresses›.
W230126 15:39:40.661779 1 1@cli/start.go:1210 ⋮ [n?] 2 +- Intruders with access to your machine or network can observe client-server traffic.
W230126 15:39:40.661779 1 1@cli/start.go:1210 ⋮ [n?] 2 +- Intruders can log in without password and read or write any data in the cluster.
W230126 15:39:40.661779 1 1@cli/start.go:1210 ⋮ [n?] 2 +- Intruders can consume all your server's resources and cause unavailability.
I230126 15:39:40.661831 1 1@cli/start.go:1220 ⋮ [n?] 3  To start a secure server without mandating TLS for clients,
I230126 15:39:40.661831 1 1@cli/start.go:1220 ⋮ [n?] 3 +consider --accept-sql-without-tls instead. For other options, see:
I230126 15:39:40.661831 1 1@cli/start.go:1220 ⋮ [n?] 3 +
I230126 15:39:40.661831 1 1@cli/start.go:1220 ⋮ [n?] 3 +- https://go.crdb.dev/issue-v/53404/v22.2
I230126 15:39:40.661831 1 1@cli/start.go:1220 ⋮ [n?] 3 +- https://www.cockroachlabs.com/docs/v22.2/secure-a-cluster.html
I230126 15:39:40.662186 1 1@cli/start.go:1249 ⋮ [n?] 4  ‹CockroachDB CCL v22.2.3 (aarch64-unknown-linux-gnu, built 2023/01/23 19:38:24, go1.19.1)›
I230126 15:39:40.663732 1 server/config.go:538 ⋮ [n?] 5  system total memory: 2.0 GiB
I230126 15:39:40.663775 1 server/config.go:540 ⋮ [n?] 6  server configuration:
I230126 15:39:40.663775 1 server/config.go:540 ⋮ [n?] 6 +‹max offset             500000000›
I230126 15:39:40.663775 1 server/config.go:540 ⋮ [n?] 6 +‹cache size             512 MiB›
I230126 15:39:40.663775 1 server/config.go:540 ⋮ [n?] 6 +‹SQL memory pool size   512 MiB›
I230126 15:39:40.663775 1 server/config.go:540 ⋮ [n?] 6 +‹scan interval          10m0s›
I230126 15:39:40.663775 1 server/config.go:540 ⋮ [n?] 6 +‹scan min idle time     10ms›
I230126 15:39:40.663775 1 server/config.go:540 ⋮ [n?] 6 +‹scan max idle time     1s›
I230126 15:39:40.663775 1 server/config.go:540 ⋮ [n?] 6 +‹event log enabled      true›
I230126 15:39:40.663775 1 server/config.go:540 ⋮ [n?] 6 +‹span configs enabled   true›
I230126 15:39:40.663926 1 1@cli/start.go:1114 ⋮ [n?] 7  using local environment variables:
I230126 15:39:40.663926 1 1@cli/start.go:1114 ⋮ [n?] 7 +COCKROACH_CHANNEL=official-docker
I230126 15:39:40.663926 1 1@cli/start.go:1114 ⋮ [n?] 7 +GOMAXPROCS=2
I230126 15:39:40.663926 1 1@cli/start.go:1114 ⋮ [n?] 7 +TERM=‹xterm›
I230126 15:39:40.663998 1 1@cli/start.go:1121 ⋮ [n?] 8  process identity: ‹uid 0 euid 0 gid 0 egid 0›
W230126 15:39:40.664184 1 1@cli/start.go:1270 ⋮ [n?] 9  could not initialize GEOS - spatial functions may not be available: geos: error during GEOS init: geos: cannot load GEOS from dir ‹"/usr/local/lib/cockroach"›: ‹geos error: /usr/local/lib/cockroach/libgeos.so: ELF load command alignment not page-aligned›
I230126 15:39:40.664255 1 1@cli/start.go:574 ⋮ [n?] 10  starting cockroach node
I230126 15:39:40.665490 44 3@pebble/event.go:657 ⋮ [n?,pebble,s?] 11  [JOB 1] flushing: sstable created 000046
I230126 15:39:40.671491 44 3@pebble/event.go:649 ⋮ [n?,pebble,s?] 12  [JOB 1] MANIFEST created 000048
I230126 15:39:40.672664 44 3@pebble/event.go:677 ⋮ [n?,pebble,s?] 13  [JOB 1] WAL created 000047
I230126 15:39:40.677098 44 server/config.go:734 ⋮ [n?] 14  1 storage engine initialized
I230126 15:39:40.677136 44 server/config.go:737 ⋮ [n?] 16  Pebble cache size: 512 MiB
I230126 15:39:40.677153 44 server/config.go:737 ⋮ [n?] 17  store 0: max size 0 B, max open file limit 1043576
I230126 15:39:40.677160 44 server/config.go:737 ⋮ [n?] 18  store 0: {Encrypted:false ReadOnly:false FileStoreProperties:{path=‹/cockroach/cockroach-data›, fs=tmpfs, blkdev=‹tmpfs›, mnt=‹/sys/firmware› opts=‹ro,context="system_u:object_r:container_file_t:s0:c712,c824",relatime›}}
I230126 15:39:40.675924 70 3@pebble/event.go:681 ⋮ [n?,pebble,s?] 15  [JOB 1] WAL deleted 000042
I230126 15:39:40.677311 71 3@pebble/event.go:669 ⋮ [n?,pebble,s?] 19  [JOB 2] all initial table stats loaded
I230126 15:39:40.677482 70 3@pebble/event.go:661 ⋮ [n?,pebble,s?] 20  [JOB 1] sstable deleted 000040
I230126 15:39:40.677532 70 3@pebble/event.go:661 ⋮ [n?,pebble,s?] 21  [JOB 1] sstable deleted 000041
I230126 15:39:40.677570 70 3@pebble/event.go:653 ⋮ [n?,pebble,s?] 22  [JOB 1] MANIFEST deleted 000038
I230126 15:39:40.677627 72 3@pebble/event.go:625 ⋮ [n?,pebble,s?] 23  [JOB 3] compacting(default) L0 [000046] (1011 B) + L6 [000045] (1.0 M)
I230126 15:39:40.677938 72 3@pebble/event.go:657 ⋮ [n?,pebble,s?] 24  [JOB 3] compacting: sstable created 000050
I230126 15:39:40.678373 44 server/tracedumper/tracedumper.go:120 ⋮ [n?] 25  writing job trace dumps to ‹/cockroach/cockroach-data/logs/inflight_trace_dump›
I230126 15:39:40.685159 44 3@pebble/event.go:649 ⋮ [n?,pebble,s?] 26  [JOB 1] MANIFEST created 000001
I230126 15:39:40.688350 44 3@pebble/event.go:677 ⋮ [n?,pebble,s?] 27  [JOB 1] WAL created 000002
I230126 15:39:40.715902 86 3@pebble/event.go:669 ⋮ [n?,pebble,stemp] 28  [JOB 2] all initial table stats loaded
I230126 15:39:40.721850 72 3@pebble/event.go:629 ⋮ [n?,pebble,s?] 29  [JOB 3] compacted(default) L0 [000046] (1011 B) + L6 [000045] (1.0 M) -> L6 [000050] (1.0 M), in 0.0s (0.0s total), output rate 24 M/s

And here is the contents of cockroach-stderr.cockroachdb-0.root.2023-01-26T15_39_40Z.000001.log:

I230126 15:39:40.661097 1 util/log/file_sync_buffer.go:238 ⋮ [config]   file created at: 2023/01/26 15:39:40
I230126 15:39:40.661114 1 util/log/file_sync_buffer.go:238 ⋮ [config]   running on machine: ‹cockroachdb-0›
I230126 15:39:40.661121 1 util/log/file_sync_buffer.go:238 ⋮ [config]   binary: CockroachDB CCL v22.2.3 (aarch64-unknown-linux-gnu, built 2023/01/23 19:38:24, go1.19.1)
I230126 15:39:40.661127 1 util/log/file_sync_buffer.go:238 ⋮ [config]   arguments: [‹/cockroach/cockroach› ‹start-single-node› ‹--log=sinks: {stderr: {filter: INFO}}› ‹--insecure› ‹--advertise-host› ‹cockroachdb-0.cockroachdb-internal.default.svc.cluster.local› ‹--http-addr› ‹0.0.0.0› ‹--cache› ‹512MiB› ‹--max-sql-memory› ‹512MiB›]
I230126 15:39:40.661186 1 util/log/file_sync_buffer.go:238 ⋮ [config]   log format (utf8=✓): crdb-v2
I230126 15:39:40.661190 1 util/log/file_sync_buffer.go:238 ⋮ [config]   line format: [IWEF]yymmdd hh:mm:ss.uuuuuu goid [chan@]file:line redactionmark \[tags\] [counter] msg
I230126 15:39:40.660833 1 util/log/flags.go:211  [-] 1  stderr capture started
panic: pebble: inconsistent reference count: -1

goroutine 72 [running]:
github.com/cockroachdb/pebble/internal/cache.(*refcnt).release(0x0?)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/refcnt_normal.go:39 +0xc0
github.com/cockroachdb/pebble/internal/cache.(*Value).release(0xfffcee079100)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/value.go:43 +0x2c
github.com/cockroachdb/pebble/internal/cache.(*entry).setValue(0x4002044060?, 0xfffcee5f8300?)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/entry.go:141 +0x68
github.com/cockroachdb/pebble/internal/cache.(*shard).metaDel(0x40021d9e70, 0xfffcee5f85a0)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/clockpro.go:367 +0x34
github.com/cockroachdb/pebble/internal/cache.(*shard).metaEvict(0x40021d9da0?, 0xfffcee5f85a0)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/clockpro.go:470 +0x98
github.com/cockroachdb/pebble/internal/cache.(*shard).EvictFile(0x40021d9e70, 0x400205a0b0?, 0x41d0ba0?)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/clockpro.go:260 +0xbc
github.com/cockroachdb/pebble/internal/cache.(*Cache).EvictFile(...)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/clockpro.go:764
github.com/cockroachdb/pebble.(*tableCacheShard).evict(0x400205a0b0, 0x2d, 0x40008cc108, 0x0)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:786 +0x268
github.com/cockroachdb/pebble.(*tableCacheContainer).evict(0x41d0ba0?, 0x4006d59098?)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:152 +0x40
github.com/cockroachdb/pebble.(*DB).doDeleteObsoleteFiles(0x400185ad80, 0x3)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:3096 +0x7bc
github.com/cockroachdb/pebble.(*DB).deleteObsoleteFiles(0x400185ad80, 0x4000b34140?, 0x7?)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2995 +0x38
github.com/cockroachdb/pebble.(*DB).compact1(0x400185ad80, 0x40004dc000, 0x0)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2220 +0x540
github.com/cockroachdb/pebble.(*DB).compact.func1({0x5a784e0, 0x40008fc9f0})
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2151 +0xd8
runtime/pprof.Do({0x5a78470?, 0x400007a068?}, {{0x4000228e20?, 0xbeb7e328f06fc75f?, 0xfe3c3d8b7066e0df?}}, 0x400155ff78)
        GOROOT/src/runtime/pprof/runtime.go:40 +0x80
github.com/cockroachdb/pebble.(*DB).compact(0xe4602e7cc636f13b?, 0x76ce7b6718776d3?, 0xc005b416a3926f1b?)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2148 +0x7c
created by github.com/cockroachdb/pebble.(*DB).maybeScheduleCompactionPicker
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:1869 +0x490
sumeerbhola commented 1 year ago

I spent a while reading through the code again. These occurrences (at least the ones on v22.2.3) are different from the one in the original issue -- the new ones are occurring when a file has become obsolete and we are removing all the block cache entries for it. Though the root cause could be the same. There is ref counting happening at multiple levels and we have fairly liberal invariant checking (I did not try to look through all the invariant checks to see if we could add more).

Even if some of the higher-level ref counts go wrong, or we try to obsolete a file multiple times, or we somehow make the shard.files map be stale (has some entries that are not in the shard.blocks map), we eventually have to fiddle with the Value.refcnt via Value.Release. There are only two consequential callers of this (the third one is an error path where the Value has been allocated but not inserted into the cache yet):

sumeerbhola commented 1 year ago

Could not find any problematic caller of Handle.Release. The one case which required a more careful look was https://github.com/cockroachdb/pebble/blob/crl-release-22.2/sstable/reader.go#L2907 because we are overwriting the value of b later in the function https://github.com/cockroachdb/pebble/blob/crl-release-22.2/sstable/reader.go#L2932, so we will call b.Release twice. But since Release has a value receiver and we change the value of b, it will get called with the different BlockHandles.

nicktrav commented 1 year ago

I suspect a path forward here would be to come up with a minimal reproducer on our end that we can dissect. We don't currently test against Oracle Linux nor is Arm tested in Pebble (we do in Cockroach), so we have a blind spot here.

Adding an ARM variant to our Pebble tests would be nice. However, given we use GitHub Actions, I think progress here may be limited for now, as they do not currently support ARM (see this discussion).

@lbguilherme - I see you mentioned K8s. Have you observed this failure mode outside of a K8s setup? Would you mind attempting to reproduce this without the K8s bells and whistles, so we can rule that out (or narrow it down to K8s)? The same start-single-node ... command you are using is good.

And if you could comment on the filesystem setup, that would be useful information to have too. i.e. "physical" device you're using (cloud hosted persistent disks, local SSD, etc.), filesystem you're using (we support ext4 and xfs), any additional information you have for how you're configuring your storage, etc.

lbguilherme commented 1 year ago

I gave a try without K8S: A virtual machine on Oracle Cloud running Oracle Linux on Ampere CPU. Using cockroach-v22.2.3.linux-arm64 downloaded from the official releases page. I got the same crash after about 1 hour. Same story: I just ran the single-node command and didn't connect to it, starting from an empty storage. The stack trace is the same as before. The file system is the boot volume, using XFS and running from a network block storage, not a local disk (the standard configuration on Oracle Cloud). This was a clean VM created just for this test.

nicktrav commented 1 year ago

I gave a try without K8S: A virtual machine on Oracle Cloud running Oracle Linux on Ampere CPU. Using cockroach-v22.2.3.linux-arm64 downloaded from the official releases page.

The file system is the boot volume, using XFS and running from a network block storage, not a local disk (the standard configuration on Oracle Cloud). This was a clean VM created just for this test.

Thanks for confirming. This helps narrow it down.

nicktrav commented 1 year ago

I have been able to reproduce this on an ARM VM running Oracle Linux 8 with an attached XFS volume. This was on Oracle Cloud. I see a panic identical to the one reported. It took around 30 mins. I also had a KV95 workload running against it.

I'm going to try and narrow this down a little further. Ideally I'd be able to reproduce this issue locally.

nicktrav commented 1 year ago

I've tried various combinations of processor (aarch64 vs amd64), operating system (Oracle Linux 7, 8 and 9, and Ubuntu Focal) and filesystem (ext4 vs xfs).

The combinations I can get to reliably fail are when using an ARM processor and Oracle Linux 7.x and 8.x (I tested a few of the available options - Oracle-Linux-7.9-aarch64-2022.12.15-0, Oracle-Linux-8.6-aarch64-2022.12.15-0 and Oracle-Linux-8.6-aarch64-2022.12.06-0). The filesystem doesn't seem to make a difference - it fails when I use the XFS boot volume, or an attached volume formatted as XFS or EXT4.

This probably isn't the answer you're looking for, but would it be possible for you to try out a different operating system? Either Oracle Linux 9, or perhaps Ubuntu? That said, I took a quick look at Oracle's K8s offering, and it look like the only OS available for the nodes is Oracle Linux 7 or 8.

I'm going to continue poking at this, mainly via the Pebble test suite on the operating systems where this is crashing.

lbguilherme commented 1 year ago

This probably isn't the answer you're looking for, but would it be possible for you to try out a different operating system? Either Oracle Linux 9, or perhaps Ubuntu? That said, I took a quick look at Oracle's K8s offering, and it look like the only OS available for the nodes is Oracle Linux 7 or 8.

Yeah, we will figure out something in the mean time and host the database outside our cluster.

As a side note, there is also another issue that seems specific to Oracle Linux. Maybe it is related, maybe not:

could not initialize GEOS - spatial functions may not be available: geos: error during GEOS init: geos: cannot load GEOS from dir ‹"/usr/local/lib/cockroach"›: ‹geos error: /usr/local/lib/cockroach/libgeos.so: ELF load command alignment not page-aligned›

Oracle Linux use a page size of 64 kB by default, instead of the usual 4 kB.

nicktrav commented 1 year ago

That seems separate to this issue. I'd ask that you create a new issue for that one, with reproduction steps, etc.

nicktrav commented 1 year ago

It's not clear what the path forward here is other than using a distro other than Oracle Linux 7/8. Given this works on a newer version of Oracle Linux, and other Linux distros (we test on Ubuntu), I'm not sure there's much more we can realistically do here.

I'm going to close this for now. If you're still hitting the issue on Oracle Linux 9, please file a new issue.

For the GEOS issue, if that's still occurring, as already mentioned, please open a new issue for that specifically.

jtolio commented 1 year ago

Is it possible to re-open this? I'm reliably and consistently getting this on cockroach 22.2.5 linux-arm64 (I'm on an Asahi Linux kernel (6.1.0-asahi) on a Mac M1 mini, so perhaps I'm not a high value target).

panic: pebble: inconsistent reference count: -1

goroutine 36552 [running]:
github.com/cockroachdb/pebble/internal/cache.(*refcnt).release(0x0?)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/refcnt_normal.go:39 +0xc0
github.com/cockroachdb/pebble/internal/cache.(*Value).release(0xffff6e971000)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/value.go:43 +0x2c
github.com/cockroachdb/pebble/internal/cache.(*entry).setValue(0x4009835980?, 0xffff6ecadcc0?)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/entry.go:141 +0x68
github.com/cockroachdb/pebble/internal/cache.(*shard).metaDel(0x4000e69d50, 0xffff6ecae1a0)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/clockpro.go:367 +0x34
github.com/cockroachdb/pebble/internal/cache.(*shard).metaEvict(0x4000e69c80?, 0xffff6ecae1a0)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/clockpro.go:470 +0x98
github.com/cockroachdb/pebble/internal/cache.(*shard).EvictFile(0x4000e69d50, 0x40025ff8c0?, 0x41de300?)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/clockpro.go:260 +0xbc
github.com/cockroachdb/pebble/internal/cache.(*Cache).EvictFile(...)
        github.com/cockroachdb/pebble/internal/cache/external/com_github_cockroachdb_pebble/internal/cache/clockpro.go:764
github.com/cockroachdb/pebble.(*tableCacheShard).evict(0x40025ff8c0, 0xc, 0x400016af88, 0x0)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:786 +0x268
github.com/cockroachdb/pebble.(*tableCacheContainer).evict(0x41de300?, 0x4014aa48e8?)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:152 +0x40
github.com/cockroachdb/pebble.(*DB).doDeleteObsoleteFiles(0x4001b4ed80, 0xf)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:3096 +0x7bc
github.com/cockroachdb/pebble.(*DB).deleteObsoleteFiles(0x4001b4ed80, 0x4009251470?, 0x7?)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2995 +0x38
github.com/cockroachdb/pebble.(*DB).compact1(0x4001b4ed80, 0x400a2d2400, 0x0)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2220 +0x540
github.com/cockroachdb/pebble.(*DB).compact.func1({0x5a8d1a0, 0x40116cf440})
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2151 +0xd8
runtime/pprof.Do({0x5a8d130?, 0x4000074130?}, {{0x4000158bc0?, 0x4003a84e00?, 0xf73e00?}}, 0x4002f5ff78)
        GOROOT/src/runtime/pprof/runtime.go:40 +0x80
github.com/cockroachdb/pebble.(*DB).compact(0x0?, 0x4002f5ff88?, 0x4002f5ff58?)
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2148 +0x7c
created by github.com/cockroachdb/pebble.(*DB).maybeScheduleCompactionPicker
        github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:1869 +0x490
nicktrav commented 1 year ago

@jtolio - thanks for the heads up. We can take another look. This seems like it's an ARM thing, which we do support.

Mind providing the set of commands you're running to hit this?

jbowens commented 1 year ago

I created a t2a-standard-4 instance on GCP, cross-compiled for arm64 and ran:

./pebble-arm64 bench ycsb data --workload=read=100 --concurrency=1 --values=64 --initial-keys 10000000 --cache 4294967296 --num-ops=1
./pebble-arm64 bench ycsb data --workload=A --concurrency=256 --values=64 --keys=zipf --prepopulated-keys 10000000 --cache 4294967296 --duration 1h
dir data

Will report back if this surfaces anything.

itsbilal commented 1 year ago

I tried the same as what @jbowens did, except on AWS and with a m7g.2xlarge running Ubuntu. Couldn't hit this after a whole day of running ycsb. This is looking very Linux distro-specific rather than Kernel or arch-related? Also some of the Sentry failures seem to be from amd64, such as this one from #104236:

https://cockroach-labs.sentry.io/issues/4221901270/?project=164528&referrer=webhooks_plugin

Possibly also explainable with non-ECC RAM.

jtolio commented 1 year ago

Oops, sorry for disappearing. So, I'm on Asahi Linux kernel (on a Mac M1), which has a 16K page size if that matters.

./cockroach start-single-node --insecure --listen-addr=localhost --store=type=mem,size=1GiB --max-sql-memory=1GiB

I work for Storj, and running Storj unit tests (and/or just waiting for a bit) reliably kills it.

git clone git@github.com:storj/storj.git
cd storj
go test ./... --cockroach-test-db=cockroach://root@localhost:26527/defaultdb?sslmode=disable
RaduBerinde commented 1 year ago

Here is one hypothesis. In the code below, we initialize the value using a simple assignment, instead of an atomic Store. We initialize it to 1 in all code paths. It might be possible that on M1 this assignment won't always be "seen" by the atomic operations. In general it's a bad idea to mix atomic and non-atomic operations (one of multiple reasons we should be using wrappers like atomic.Int32).

func (v *refcnt) init(val int32) {
    *v = refcnt(val)
}

func (v *refcnt) refs() int32 {
    return atomic.LoadInt32((*int32)(v))
}

func (v *refcnt) acquire() {
    switch v := atomic.AddInt32((*int32)(v), 1); {
    case v <= 1:
        panic(fmt.Sprintf("pebble: inconsistent reference count: %d", v))
    }
}

@jtolio if I prepare a custom branch of Cockroach, will you be able to build it and test it?

RaduBerinde commented 1 year ago

@jtolio I merged a potential fix to cockroach in #105086. Would you be able to try to repro against the current master?

RaduBerinde commented 1 year ago

You can also download this nightly build: https://storage.googleapis.com/cockroach-builds-artifacts-prod/cockroach-v23.1.0-alpha.7-3218-g795dd78a4c0.linux-arm64.tgz

jtolio commented 1 year ago

Ooops, sorry, been away from the computer. Once I return to my workstation this next week I will totally build and test. I'll try to build without the patch, confirm it's broken, build with just the patch cherry-picked, see what happens, etc., to make sure there isn't some confounding factor.

RaduBerinde commented 1 year ago

@jtolio friendly ping - it would be very useful to see if/which change fixes this (so we know if we should backport). We have not been able to reproduce the panic from our end.

jtolio commented 1 year ago

I got a little stymied between bazel and make when I tried last and then I got distracted (for weeks). It appears that https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/2221703221/Developing+with+Bazel is now public (it wasn't when I tried), so I'll give that a go again.

RaduBerinde commented 1 year ago

@jtolio You can just download this nightly build: https://storage.googleapis.com/cockroach-builds-artifacts-prod/cockroach-v23.1.0-alpha.7-3218-g795dd78a4c0.linux-arm64.tgz

jtolio commented 1 year ago

That fails with:

<jemalloc>: Unsupported system page size
<jemalloc>: Unsupported system page size
<jemalloc>: Unsupported system page size
runtime/cgo: malloc failed: Cannot allocate memory
RaduBerinde commented 1 year ago

Huh.. That might be due to a jemalloc upgrade in one of the recent versions.

So you probably can't run anything from master.. I wonder what versions you can actually run (22.2.x, 23.1.x).

jtolio commented 1 year ago

Sure, my issue has been with a copy of v22.2.5, so I checked out that tag and that's what I'm trying to build with your change (and the other change mentioned in this issue) applied.

jtolio commented 1 year ago

lol, I'm so sorry I haven't been more helpful on this issue yet.

I got tag v22.2.5 to compile locally with bazel (required some undocumented bazelisk setup, libresolv-wrapper, --force_build_cdeps, and other build system hazing). However, I can't get my build of v22.2.5 to fail. My downloaded version of precompiled v22.2.5 fails regularly and repeatedly, but building the tag (./dev build short; bazel build //pkg/cmd/cockroach-short:cockroach-short --force_build_cdeps) results in a binary that reliably works for me lol.

So I don't know what to do about that. Could it be that it's a short build? Could it be that I didn't use the cross compilation infrastructure? I don't know. I didn't even get to cherry picking anything.

RaduBerinde commented 1 year ago

Very strange.. I would try with a full binary.

@rickystewart Should building locally with bazel guarantee that the resulting binary is the same as what our build system would produce? Or does it still depend on some stuff in the local environment?

jtolio commented 1 year ago

Full binary also works fine, though perhaps my problem is being caused by --force_build_cdeps, which as I understand this build system, builds the C dependencies without PIC. I guess I'll try to get without force_build_cdeps (I have this issue also: https://github.com/cockroachdb/cockroach/issues/80389)

jtolio commented 1 year ago

Ha okay, I don't know. https://github.com/cockroachdb/cockroach/pull/80594 (which evidently fixes #80389) is already in v22.2.5. Happy to pair with someone if this is felt to be important to debug.

RaduBerinde commented 1 year ago

I will backport my potential fix to 23.1.x and 22.2.x and we can wait for the next release and you can try that binary.

RaduBerinde commented 1 year ago

The jemalloc issue on 23.x is now tracked in #106745.

jbowens commented 1 year ago

Recent failures on refcnt.release have all been amd64, not arm: https://cockroach-labs.sentry.io/issues/4365765342/?query=is%3Aunresolved+%28%2Arefcnt%29.release&referrer=issue-stream&statsPeriod=14d&stream_index=0

We've continued to see some refcnt.acquire reports on arm64: https://cockroach-labs.sentry.io/issues/4109913192/?query=is%3Aunresolved+%28%2AValue%29.acquire&referrer=issue-stream&statsPeriod=14d&stream_index=0

Relatedly, there are reports of nil pointers in unlink, link:

The vast majority (>99%) of these reports are from within <1m of node start, supporting the conclusion that there is an environmental issue resulting in fast failure.

There is still no 22.2 patch release that contains either of these related fixes:

And no 23.1 patch release that contains https://github.com/cockroachdb/cockroach/commit/0e22c6103ae4d6792acfb53cab5d432a99882d76 (but af0c53ad0f7dc1aec5945f13697ffc18b076b30c is available in the most recent patch).

Given the distribution of uptime across events, I expect a user with an incompatible environment to observe it promptly. We have never observed this issue across all our nightly cloud testing. I believe there's little generalized risk to arm deployments that do not immediately encounter this issue.