Closed daeMOn63 closed 2 years ago
I can replicate. Looks like it's coming from the ORM, something to do with an iterator. Will investigate further 🕵️♂️
Looks pretty concerning. This needs to be fixed for 0.46
Findings so far:
TL;DR I don't know what's going on yet but: ends up calling cachekv
dirtyItems
, which callsclearUnsortedCacheSubset
. This one callserr := store.sortedCache.Set(item.Key, item.Value)
and hangs while trying to acquire a mutex lock withdb.mtx.Lock()
.https://github.com/tendermint/tm-db/blob/b2c83b9876d6fd85d26d0d1e5b8a2e17e0d00f82/memdb.go#L93-L101
it.LoadNext
doesn't return.
https://github.com/cosmos/cosmos-sdk/blob/7feae9c4dd0fecceab348ace054561dd0638cf6b/x/group/internal/orm/iterator.go#L178-L181
That calls NewTypeSafeRowGetter
which itself calls pStore.Iterator
, this one doesn't return. (pStore is store/prefix.Store its parent looks like it's a gas KVStore)
See prefix/store iter := s.parent.Iterator(newstart, newend)
calls gaskv's Iterator()
gaskv Iterator()
calls iterator()
which inside calls parent = gs.parent.Iterator(start, end)
(gs.parent is cachekv)
(I skipped a bunch of places)
Ends up calling cachekv dirtyItems
, which calls clearUnsortedCacheSubset
. This one calls err := store.sortedCache.Set(item.Key, item.Value)
and hangs while trying to acquire a mutex lock with db.mtx.Lock()
.
https://github.com/tendermint/tm-db/blob/b2c83b9876d6fd85d26d0d1e5b8a2e17e0d00f82/memdb.go#L93-L101
EDIT 1: ~Up to 66 members, i.rowGetter gets called many times (probably once per member), 67 members and rowGetter gets called only once right before freezing.~
Not sure if relevant, but if you set this key (which is the one for the member number 67), then it fails even if you set the members number to something lower:
explodekey, _ := hex.DecodeString("1100000000000000020000000000000002a58856f0fd53bf058b4909a21aec019107ba6166")
db.mtx.Lock()
db.set(explodekey, []byte{})
db.mtx.Unlock()
panic: test timed out after 10s
goroutine 12 [semacquire]:
/home/user/go/pkg/mod/github.com/tendermint/tm-db@v0.6.6/memdb.go:100 +0x85
// Set implements DB.
func (db *MemDB) Set(key []byte, value []byte) error {
if len(key) == 0 {
return errKeyEmpty
}
if value == nil {
return errValueNil
}
db.mtx.Lock() // <---------
defer db.mtx.Unlock()
db.set(key, value)
return nil
}
goroutine 3302 [select]:
/home/user/go/pkg/mod/github.com/tendermint/tm-db@v0.6.6/memdb_iterator.go:67 +0x18e
func newMemDBIteratorMtxChoice(db *MemDB, start []byte, end []byte, reverse bool, useMtx bool) *memDBIterator {
ctx, cancel := context.WithCancel(context.Background())
ch := make(chan *item, chBufferSize)
iter := &memDBIterator{
ch: ch,
cancel: cancel,
start: start,
end: end,
useMtx: useMtx,
}
if useMtx {
db.mtx.RLock()
}
go func() {
if useMtx {
defer db.mtx.RUnlock()
}
// Because we use [start, end) for reverse ranges, while btree uses (start, end], we need
// the following variables to handle some reverse iteration conditions ourselves.
var (
skipEqual []byte
abortLessThan []byte
)
visitor := func(i btree.Item) bool {
item := i.(item)
if skipEqual != nil && bytes.Equal(item.key, skipEqual) {
skipEqual = nil
return true
}
if abortLessThan != nil && bytes.Compare(item.key, abortLessThan) == -1 {
return false
}
select { // <--------------
case <-ctx.Done():
return false
case ch <- &item:
return true
}
}
The immediate symptom is that ch
becomes full, making the select
statement in newMemDBIteratorMtxChoce
block while holding a read lock on db.mtx
, which prevents Set
from acquiring a write lock. But the actual problems are in newMemDBIteratorMtxChoice
. First and foremost, it's nonsensical for a method to optionally take a [read] lock on a mutex -- if synchronization is required, it's always required. Additionally, the select
statement occurs under the DB's mutex lock, and will block indefinitely if callers are slow or broken -- something that can of course always occur, and which the method must defend against.
First and foremost, it's nonsensical for a method to optionally take a [read] lock on a mutex -- if synchronization is required, it's always required.
Yes, I totally agree here. Very poor API design in tmdb.
It seems we're blocked here @facundomedica on redesigning memdb or at least addressing this concern. One thing I'm curious about is if we can prevent writes while we iterate?
One thing I'm curious about is if we can prevent writes while we iterate?
That's exactly what the RLock does. The problem is that whoever is iterating doesn't finish, meaning the RLock is never RUnlocked, meaning writes are prevented forever.
What I meant was to not iterate while we write explicitly in the ORM layer, not depending on the locking mechanism explicitly in the memdb implementation.
Does group have an open iterator while writing? That does violate the iterator contract. But from looking at the code I don't see that happening
Does group have an open iterator while writing? That does violate the iterator contract. But from looking at the code I don't see that happening
You have the stack trace here:
goroutine 12 [semacquire]:
sync.runtime_SemacquireMutex(0x203000?, 0x0?, 0xc000f157c8?)
/usr/lib/go/src/runtime/sema.go:71 +0x25
sync.(*RWMutex).Lock(0x0?)
/usr/lib/go/src/sync/rwmutex.go:144 +0x71
github.com/tendermint/tm-db.(*MemDB).Set(0xc000b4fbe0, {0xc000ac9bc0, 0x1e, 0x20}, {0xc00084d3b0, 0x44, 0x44})
/home/user/go/pkg/mod/github.com/tendermint/tm-db@v0.6.6/memdb.go:100 +0x85
github.com/cosmos/cosmos-sdk/store/cachekv.(*Store).clearUnsortedCacheSubset(0xc0009f2700, {0xc000cabbb0?, 0x2, 0x2}, 0x1)
/home/user/workspace/cosmos/cosmos-sdk/store/cachekv/store.go:359 +0x165
github.com/cosmos/cosmos-sdk/store/cachekv.(*Store).dirtyItems(0xc0009f2700, {0xc000ac9b40, 0x1e, 0x1e}, {0xc000ac9ba0, 0x1e, 0x1e})
/home/user/workspace/cosmos/cosmos-sdk/store/cachekv/store.go:331 +0x52b
github.com/cosmos/cosmos-sdk/store/cachekv.(*Store).iterator(0xc0009f2700, {0xc000ac9b40, 0x1e, 0x1e}, {0xc000ac9ba0, 0x1e, 0x1e}, 0x1)
/home/user/workspace/cosmos/cosmos-sdk/store/cachekv/store.go:190 +0x17e
github.com/cosmos/cosmos-sdk/store/cachekv.(*Store).Iterator(0xc000e0ac40?, {0xc000ac9b40?, 0xc000f15b28?, 0x0?}, {0xc000ac9ba0?, 0x1?, 0x40f645?})
/home/user/workspace/cosmos/cosmos-sdk/store/cachekv/store.go:170 +0x29
github.com/cosmos/cosmos-sdk/store/gaskv.(*Store).iterator(0xc0004143c0, {0xc000ac9b40?, 0x0?, 0xc000f15b00?}, {0xc000ac9ba0?, 0x2?, 0x1?}, 0xa0?)
/home/user/workspace/cosmos/cosmos-sdk/store/gaskv/store.go:103 +0x56
github.com/cosmos/cosmos-sdk/store/gaskv.(*Store).Iterator(0xc000420800?, {0xc000ac9b40?, 0x25?, 0xc0006172cc?}, {0xc000ac9ba0?, 0x40f645?, 0xc0003d09f0?})
/home/user/workspace/cosmos/cosmos-sdk/store/gaskv/store.go:74 +0x29
github.com/cosmos/cosmos-sdk/store/prefix.Store.Iterator({{0x2966738, 0xc0004143c0}, {0xc0006172cc, 0x2, 0x2}}, {0xc0003d09c9, 0x1c, 0x27}, {0xc000ac99e0, 0x1c, ...})
/home/user/workspace/cosmos/cosmos-sdk/store/prefix/store.go:101 +0x239
github.com/cosmos/cosmos-sdk/x/group/internal/orm.NewTypeSafeRowGetter.func1({0x2966738, 0xc0004143c0}, {0xc0003d09c9, 0x1c, 0x27}, {0x29643f8, 0xc000cabb90})
/home/user/workspace/cosmos/cosmos-sdk/x/group/internal/orm/types.go:116 +0x2f4
github.com/cosmos/cosmos-sdk/x/group/internal/orm.indexIterator.LoadNext({{0x2966738, 0xc0004143c0}, 0xc0009fb1a0, {0x2960fc0, 0xc0004145a0}, {0x1c0a0c0, 0x388eb20}}, {0x29643f8, 0xc000cabb90})
/home/user/workspace/cosmos/cosmos-sdk/x/group/internal/orm/index.go:253 +0x106
github.com/cosmos/cosmos-sdk/x/group/internal/orm.Paginate({0x2947948?, 0xc000e0ac40}, 0x295a2b0?, {0x1bb2580?, 0xc00080b8f0?})
/home/user/workspace/cosmos/cosmos-sdk/x/group/internal/orm/iterator.go:180 +0x42a
github.com/cosmos/cosmos-sdk/x/group/keeper.Keeper.GroupMembers({{0x2945a08, 0xc0009ee170}, {0x295a2b0, 0xc000025900}, {0xc000239200, {0x1}}, {0x295df20, 0xc0009f31c0}, {0xc000239260}, {0x295df20, ...}, ...}, ...)
/home/user/workspace/cosmos/cosmos-sdk/x/group/keeper/grpc_query.go:63 +0x31f
github.com/cosmos/cosmos-sdk/x/group/keeper_test.(*TestSuite).createGroupAndGetMembers(0xc000580c00, 0x43)
/home/user/workspace/cosmos/cosmos-sdk/x/group/keeper/keeper_test.go:117 +0x5a5
github.com/cosmos/cosmos-sdk/x/group/keeper_test.(*TestSuite).TestCreateGroupWithLotsOfMembers(0xc000580c00)
/home/user/workspace/cosmos/cosmos-sdk/x/group/keeper/keeper_test.go:95 +0x97
Basically, the orm/iterator eventually calls store/cachekv.Store.dirtyItems, which calls clearUnsortedCacheSubset, which invokes a Set
on on the sortedCache
field, which is a *dbm.MemDB
.
Summary of Bug
When testing creating groups and querying for their members, a weird deadlock appears as soon as the group has more than
67
members. I've validated that it's fine with any numbers of members up to 66, and anything above this ends with a timeout.Version
master @ ece3d0e9fe1da69966e32391931904607fca127f
Steps to Reproduce
The following test case illustrates and reproduce the issue (placed in
x/group/keeper/keeper_test.go
)This produces the following output, where we can see the group with 67 members gets successfully created, but then fail to query its members. Note that the loop used in the above test does not matter, a single call to
s.createGroupAndGetMembers(67)
fails in the exact same way.For Admin Use