warner commented 3 years ago

The phase3 testnet halted this morning. In my monitoring node, the last thing I observe is a consensus failure:

7:00AM ERR CONSENSUS FAILURE!!! err="Error: MDB_MAP_FULL: Environment mapsize limit reached" module=consensus stack="goroutine 1332621 [running]:
runtime/debug.Stack(0xc003273948, 0x7fbd83111540, 0xc00b434450)
  /usr/local/go/src/runtime/debug/stack.go:24 +0x9f
github.com/tendermint/tendermint/consensus.(*State).receiveRoutine.func2(0xc000bd8000, 0x7fbd8336d5b0)
  /home/warner/go/pkg/mod/github.com/tendermint/tendermint@v0.34.10/consensus/state.go:726 +0x5e
panic(0x7fbd83111540, 0xc00b434450)
  /usr/local/go/src/runtime/panic.go:969 +0x1b9
github.com/Agoric/agoric-sdk/golang/cosmos/x/swingset.EndBlock(0x7fbd833e1e20, 0xc0000460a8, 0x7fbd833f76c0, 0xc00700b780, 0xb, 0x0, 0xc006a12910, 0xd, 0xd529, 0x8bde991, ...)
  /home/warner/stuff/agoric/phase3/agoric-sdk/golang/cosmos/x/swingset/abci.go:79 +0x5a5
github.com/Agoric/agoric-sdk/golang/cosmos/x/swingset.AppModule.EndBlock(0x0, 0x0, 0x7fbd833c1fa0, 0xc000ff9d60, 0x7fbd833fd0e0, 0xc000ff8810, 0x7fbd833c1fa0, 0xc000ff9ca0, 0x7fbd833f7cc0, 0xc000ff8810, ...)
  /home/warner/stuff/agoric/phase3/agoric-sdk/golang/cosmos/x/swingset/module.go:130 +0x98
github.com/cosmos/cosmos-sdk/types/module.(*Manager).EndBlock(0xc000d70930, 0x7fbd833e1e20, 0xc0000460a8, 0x7fbd833f76c0, 0xc00700b780, 0xb, 0x0, 0xc006a12910, 0xd, 0xd529, ...)
  /home/warner/go/pkg/mod/github.com/agoric-labs/cosmos-sdk@v0.34.4-0.20210504041254-7cead73a2b33/types/module/module.go:354 +0x285
github.com/Agoric/agoric-sdk/golang/cosmos/app.(*GaiaApp).EndBlocker(...)
  /home/warner/stuff/agoric/phase3/agoric-sdk/golang/cosmos/app/app.go:572
github.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).EndBlock(0xc000d79380, 0xd529, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
  /home/warner/go/pkg/mod/github.com/agoric-labs/cosmos-sdk@v0.34.4-0.20210504041254-7cead73a2b33/baseapp/abci.go:198 +0x408
github.com/tendermint/tendermint/abci/client.(*localClient).EndBlockSync(0xc000ca6060, 0xd529, 0x0, 0x0, 0x0)
  /home/warner/go/pkg/mod/github.com/tendermint/tendermint@v0.34.10/abci/client/local_client.go:282 +0xdd
github.com/tendermint/tendermint/proxy.(*appConnConsensus).EndBlockSync(0xc00017c850, 0xd529, 0x0, 0x279, 0xc012c28cf0)
  /home/warner/go/pkg/mod/github.com/tendermint/tendermint@v0.34.10/proxy/app_conn.go:89 +0x42
github.com/tendermint/tendermint/state.execBlockOnProxyApp(0x7fbd833e2ba0, 0xc00298f3e0, 0x7fbd833ee600, 0xc00017c850, 0xc0042d2000, 0x7fbd833f78c0, 0xc00047df40, 0x1, 0xc0075245e0, 0x20, ...)
  /home/warner/go/pkg/mod/github.com/tendermint/tendermint@v0.34.10/state/execution.go:327 +0x654
github.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(0xc000cae0e0, 0xb, 0x0, 0x0, 0x0, 0xc002fad010, 0xd, 0x1, 0xd528, 0xc0075245e0, ...)
  /home/warner/go/pkg/mod/github.com/tendermint/tendermint@v0.34.10/state/execution.go:140 +0x165
github.com/tendermint/tendermint/consensus.(*State).finalizeCommit(0xc000bd8000, 0xd529)
  /home/warner/go/pkg/mod/github.com/tendermint/tendermint@v0.34.10/consensus/state.go:1635 +0xb48
github.com/tendermint/tendermint/consensus.(*State).tryFinalizeCommit(0xc000bd8000, 0xd529)
  /home/warner/go/pkg/mod/github.com/tendermint/tendermint@v0.34.10/consensus/state.go:1546 +0x428
github.com/tendermint/tendermint/consensus.(*State).enterCommit.func1(0xc000bd8000, 0xc000000000, 0xd529)
  /home/warner/go/pkg/mod/github.com/tendermint/tendermint@v0.34.10/consensus/state.go:1481 +0x90
github.com/tendermint/tendermint/consensus.(*State).enterCommit(0xc000bd8000, 0xd529, 0x0)
  /home/warner/go/pkg/mod/github.com/tendermint/tendermint@v0.34.10/consensus/state.go:1519 +0x6ca
github.com/tendermint/tendermint/consensus.(*State).addVote(0xc000bd8000, 0xc017f59680, 0xc013a2a2a0, 0x28, 0xc006da1200, 0x107, 0x107)
  /home/warner/go/pkg/mod/github.com/tendermint/tendermint@v0.34.10/consensus/state.go:2132 +0xdd7
github.com/tendermint/tendermint/consensus.(*State).tryAddVote(0xc000bd8000, 0xc017f59680, 0xc013a2a2a0, 0x28, 0x29e6ab262f3e, 0x7fbd83fc6160, 0x2e0f2ea1)
  /home/warner/go/pkg/mod/github.com/tendermint/tendermint@v0.34.10/consensus/state.go:1930 +0x55
github.com/tendermint/tendermint/consensus.(*State).handleMsg(0xc000bd8000, 0x7fbd833a3b20, 0xc017050610, 0xc013a2a2a0, 0x28)
  /home/warner/go/pkg/mod/github.com/tendermint/tendermint@v0.34.10/consensus/state.go:838 +0x8b7
github.com/tendermint/tendermint/consensus.(*State).receiveRoutine(0xc000bd8000, 0x0)
  /home/warner/go/pkg/mod/github.com/tendermint/tendermint@v0.34.10/consensus/state.go:762 +0x7d6
created by github.com/tendermint/tendermint/consensus.(*State).OnStart
  /home/warner/go/pkg/mod/github.com/tendermint/tendermint@v0.34.10/consensus/state.go:378 +0x896
"

This was preceeded by the Zoe vat being terminated because of an allocation meter overrun:

2021-05-10T14:00:02.753Z block-manager: block 54568 begin
2021-05-10T14:00:02.847Z block-manager: block 54568 commit
2021-05-10T14:00:09.002Z block-manager: block 54569 begin
too much computation
2021-05-10T14:00:34.210Z SwingSet: kernel: problem terminating vat v11 (ExitCode#1)
ExitCode#1: v11:zoe exited: too much computation

  at new ErrorCode (packages/xsnap/api.js:49:5)
  at ChildProcess.<anonymous> (packages/xsnap/src/xsnap.js:124:22)
  at ChildProcess.emit (events.js:310:20)
  at ChildProcess.EventEmitter.emit (domain.js:482:12)

Logging sent error stack (RemoteError#13)
RemoteError#13: vat terminated
Error: vat terminated
 at construct (0:0)
 at Error (0:0)
 at makeError (0:0)
 at fullRevive (0:0)
 at unserialize (0:0)
 at notifyOnePromise (0:0)
 at notify (0:0)
 at dispatchToUserspace (0:0)
 at (0:0)

RemoteError#13 ERROR_NOTE: Rejection from: (Error#14) : 3501 . 0
RemoteError#13 ERROR_NOTE: Rejection from: (Error#15) : 3497 . 0
RemoteError#13 ERROR_NOTE: Sent as error:liveSlots:v17#70007
Error#14: Event: 3499.1
Error: Event: 3499.1
 at construct (0:0)
 at Error (0:0)
 at trackTurns (0:0)
 at handle (0:0)
 at (0:0)
 at (0:-611476048)

Error#15: Event: 3495.2
Error: Event: 3495.2
 at construct (0:0)
 at Error (0:0)
 at trackTurns (0:0)
 at handle (0:0)
 at (0:0)
 at openLoan (0:0)
 at openLoan (0:0)
 at makeLoanKit (0:0)
 at (0:0)
 at (0:-611476048)

Logging sent error stack (RemoteError#17)
RemoteError#17: vat terminated
Error: vat terminated
 at construct (0:0)
 at Error (0:0)
 at makeError (0:0)
 at fullRevive (0:0)
 at unserialize (0:0)
 at notifyOnePromise (0:0)
 at notify (0:0)
 at dispatchToUserspace (0:0)
 at (0:0)

RemoteError#17 ERROR_NOTE: Rejection from: (Error#18) : 3533 . 0
RemoteError#17 ERROR_NOTE: Rejection from: (Error#19) : 3529 . 0
RemoteError#17 ERROR_NOTE: Sent as error:liveSlots:v17#70008
Error#18: Event: 3531.1
Error: Event: 3531.1
 at construct (0:0)
 at Error (0:0)
 at trackTurns (0:0)
 at handle (0:0)
 at (0:0)
 at (0:-611476048)

Error#19: Event: 3527.2
Error: Event: 3527.2
 at construct (0:818140016)
 at Error (0:0)
 at trackTurns (0:0)
 at handle (0:0)
 at (0:0)
 at openLoan (0:0)
 at openLoan (0:0)
 at makeLoanKit (0:0)
 at (0:0)
 at (0:-611476048)

Logging sent error stack (RemoteError#21)
RemoteError#21: vat terminated
Error: vat terminated
 at construct (0:0)
 at Error (0:0)
 at makeError (0:0)
 at fullRevive (0:0)
 at unserialize (0:0)
 at notifyOnePromise (0:0)
 at notify (0:0)
 at dispatchToUserspace (0:0)
 at (0:0)

RemoteError#21 ERROR_NOTE: Rejection from: (Error#22) : 3560 . 0
RemoteError#21 ERROR_NOTE: Rejection from: (Error#23) : 3556 . 0
RemoteError#21 ERROR_NOTE: Sent as error:liveSlots:v17#70009
Error#22: Event: 3558.1
Error: Event: 3558.1
 at construct (0:0)
 at Error (0:0)
 at trackTurns (0:0)
 at handle (0:0)
 at (0:0)
 at (0:-611476048)

Error#23: Event: 3554.2
Error: Event: 3554.2
 at construct (0:0)
 at Error (0:0)
 at trackTurns (0:0)
 at handle (0:0)
 at (0:0)
 at openLoan (0:0)
 at openLoan (0:0)
 at makeLoanKit (0:0)
 at (0:0)
 at (0:-611476048)

Logging sent error stack (RemoteError#25)
RemoteError#25: vat terminated
Error: vat terminated
 at construct (0:0)
 at Error (0:0)
 at makeError (0:0)
 at fullRevive (0:0)
 at unserialize (0:0)
 at notifyOnePromise (0:0)
 at notify (0:0)
 at dispatchToUserspace (0:0)
 at (0:0)

RemoteError#25 ERROR_NOTE: Rejection from: (Error#26) : 3647 . 0
RemoteError#25 ERROR_NOTE: Rejection from: (Error#27) : 3643 . 0
RemoteError#25 ERROR_NOTE: Sent as error:liveSlots:v17#70010
Error#26: Event: 3645.1
Error: Event: 3645.1
 at construct (0:0)
 at Error (0:0)
 at trackTurns (0:0)
 at handle (0:0)
 at (0:0)
 at (0:-611476048)

Error#27: Event: 3641.2
Error: Event: 3641.2
 at construct (0:0)
 at Error (0:0)
 at trackTurns (0:0)
 at handle (0:0)
 at (0:0)
 at openLoan (0:0)
 at openLoan (0:0)
 at makeLoanKit (0:0)
 at (0:0)
 at (0:-611476048)

Logging sent error stack (RemoteError#29)
RemoteError#29: vat terminated
Error: vat terminated
 at construct (0:0)
 at Error (0:0)
 at makeError (0:0)
 at fullRevive (0:0)
 at unserialize (0:0)
 at notifyOnePromise (0:0)
 at notify (0:0)
 at dispatchToUserspace (0:0)
 at (0:0)

RemoteError#29 ERROR_NOTE: Rejection from: (Error#30) : 3835 . 0
RemoteError#29 ERROR_NOTE: Rejection from: (Error#31) : 3831 . 0
RemoteError#29 ERROR_NOTE: Sent as error:liveSlots:v17#70011
Error#30: Event: 3833.1
Error: Event: 3833.1
 at construct (0:0)
 at Error (0:0)
 at trackTurns (0:0)
 at handle (0:0)
 at (0:0)
 at (0:-611476048)

Error#31: Event: 3829.2
Error: Event: 3829.2
 at construct (0:818110512)
 at Error (0:0)
 at trackTurns (0:0)
 at handle (0:0)
 at (0:0)
 at openLoan (0:0)
 at openLoan (0:0)
 at makeLoanKit (0:0)
 at (0:0)
 at (0:-611476048)

etc..

warner commented 3 years ago

I'm still investigating what happened leading up to this point.

@JimLarson points out this might just be an LMDB overrun: http://www.lmdb.tech/doc/group__mdb.html#gaa2506ec8dab3d969b0e609cd82e619e5

warner commented 3 years ago

The LMDB database size (basically du -s data.mdb, which isn't fooled by the sparseness of the database) grew to 2.15GB by the time it died, which might be the size we pre-configured the DB to be.

agoric_dbsize-day-1

warner commented 3 years ago

@FUDCo does that ring a bell? I remember we talked about how the LMDB size was pre-selected, but we could probably grow it later.

FUDCo commented 3 years ago

We set the allocation to 2GB as a compromise for development because on Windows (or, rather, WSL) it actually allocates a file of the full size and so eats disk space. However, we can set the size arbitrarily large and we can increase the size between cranks.

katelynsills commented 3 years ago

I think this is below the layers of abstraction that I deal with, so I'm going to remove myself, but please let me know if I can help at all.

dckc commented 3 years ago

Is there more work to do on this? Or would you like to close it, @FUDCo ?

I promoted the diagnosis to the title.

One of my issue maintenance habits is:

same symptoms, same diagnosis: same ticket; re-open if necessary
same symptoms, different diagnosis: keep it separate

Sometimes you just see the same symptoms but you don't have a diagnosis yet. Keep it separate until you have a diagnosis. If the diagnosis is the same, close as dup. (well, github doesn't insitutionalize "resolution: fixed" vs "resolution: duplicate" like trac does... but it does have labels)

dckc commented 3 years ago

3086 also has a diagnosis of LMDB out of space; is it a duplicate of this? Or was I too quick to assign a diagnosis here?

rowgraus commented 3 years ago

Should be dealt with via moving transcripts out

Agoric / agoric-sdk

zoe terminated; consensus failure due to LMDB space exhaustion in phase3 testnet #3063

3086 also has a diagnosis of LMDB out of space; is it a duplicate of this? Or was I too quick to assign a diagnosis here?