Open chillyvee opened 5 months ago
The updates to the iavl
package introduce an optimistic
mode for both exporting and importing tree data, aimed at enhancing performance by allowing raw key-value pairs to be handled without verification. This mode is supported by new methods in the Exporter
and Importer
structs, along with corresponding changes in the tree structures (ImmutableTree
and MutableTree
) to support the export and import of nodes in this optimistic manner. The changes include the ability to swap key-value pairs in a tree structure and to manage node data more efficiently at the byte level.
Files | Change Summaries |
---|---|
export.go , export_test.go |
Introduced an optimistic flag and methods for creating exporters with options, including an optimisticExport method. Added tests for tree exports and imports for consistency. |
immutable_tree.go , mutable_tree.go |
Added OptimisticExport() and OptimisticImport() methods to support exporting and importing tree nodes in an optimistic manner, facilitating the recreation of identical trees. |
import.go , import_test.go |
Added a new optimistic flag and functions for optimistic imports, including a specific OptimisticAdd function to add key-value pairs without verification. Modified the Commit function for handling optimistic imports and added tests covering different node configurations. |
nodedb.go |
Introduced methods for managing node key and value bytes (GetNodeKeyValueBytes and SaveNodeKeyValueBytes ), supporting both legacy and new node formats. Includes functionality for fetching and saving nodes to disk with raw byte arrays. |
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
Do you use memiavl in evmos, that should only take dozens of minutes to restore.
Do you use memiavl in evmos, that should only take dozens of minutes to restore.
Will try it. Thank you.
What do you think of the concept in this PR? Is it worth reducing CPU usage and inserting keys from a trusted node (for example if a validator runs 2 nodes and wants to copy keys between goleveldb databases)
Optimistic statesync enables faster database import at the restoring node by directly inserting the final key/value pairs into the database
what do you mean by "final key/value pairs", do you mean only handle leaf key values skipping branch nodes?
what do you mean by "final key/value pairs", do you mean only handle leaf key values skipping branch nodes?
Usually for the restoring node to write that to disk, the hash of all children must be computed before parents can be written to disk.
However it is unnecessary to rehash all data and rebuild that IAVL tree. The receiver can write down the result of all the work.
Instead of
db.Set(recursive hashAllChildenAndBuildIAVLTree, value)
Assuming the source is trusted, It is faster to send the resulting key and value
Sender:
resultingKey = hash("k")
resultingValue = buffer.Write(....)
send(resultingKey, resultingValue)
Receiver:
db.Set(resultingKey, resultingValue) <- no tree build or hash operations
Then the node will start as if all key/values were in the database for the current version/blockchain height.
The goal is to make a restore takes no longer than a CSV file copy over ftp/http and insert all key/values into leveldb.
what do you mean by "final key/value pairs", do you mean only handle leaf key values skipping branch nodes?
Usually for the restoring node to write that to disk, the hash of all children must be computed before parents can be written to disk.
However it is unnecessary to rehash all data and rebuild that IAVL tree. The receiver can write down the result of all the work.
Instead of
db.Set(recursive hashAllChildenAndBuildIAVLTree, value)
Assuming the source is trusted, It is faster to send the resulting key and value
Sender:
resultingKey = hash("k") resultingValue = buffer.Write(....) send(resultingKey, resultingValue)
Receiver:
db.Set(resultingKey, resultingValue) <- no tree build or hash operations
Then the node will start as if all key/values were in the database for the current version/blockchain height.
I see, so the data written into application.db
is the same right? but I guess the db write is the slowest part, for example memiavl is much faster despite it also rebuild the iavl tree from current state sync snapshot format.
I see, so the data written into
application.db
is the same right?
Yes same result to goleveldb (or other disk based iavl)
but I guess the db write is the slowest part, for example memiavl is much faster despite it also rebuild the iavl tree from current state sync snapshot format.
Yes, memiavl as a solution when RAM is sufficient.
And this solution when disk write speeds are acceptable (with no other change in expected RAM) and utilize more disk based strategy.
I see, so the data written into
application.db
is the same right?Yes same result to goleveldb (or other disk based iavl)
but I guess the db write is the slowest part, for example memiavl is much faster despite it also rebuild the iavl tree from current state sync snapshot format.
Yes, memiavl as a solution when RAM is sufficient.
memiavl doesn't need much memory actually in our practice.
memiavl doesn't need much memory actually in our practice.
Will get back to you after some testing. Having a hard time getting the right rpc/peers for evmos statesync.
Tried to use memiavl to restore evmos. Each time, the data directory is completely deleted and init is run with the chain-id evmos_9001-2.
After getting all statesync parts, the node crashes and upon start, this message appears
panic: invalid chain-id on BeginBlock; expected: , got: evmos_9001-2
The faster iavl build time is confirmed (3+ hours. Same even with local snapshot files, no network dependency + restore on a fast machine with high ram + raid0 nvme), but the node itself doesn't seem to get the chain-id, so full operation could not be confirmed on evmos.
Noticed that there is a fair amount of code changes and dependency on the cronos chain repository (github.com/crypto-org-chain/cronos/memiavl) necessary to get enable memiavl for a new chain. That might be more than a chain developer is able to do and maintain. Another thought is to bring memiavl mainline into cosmos/iavl or cosmos/memiavl. yihuang, you are most familiar so you can probably give a good comment on options or best way to implement.
This optimisitic export and import allows minimal code changes and faster restores.
The faster iavl build time is confirmed (3+ hours. Same even with local snapshot files, no network dependency + restore on a fast machine with high ram + raid0 nvme)
what is your testing procedure, how large is the snapshot file, it's dozens of minutes for us, do you use the local state sync commands (./simappd snapshots --help
)?
Did not realize the chain-id was pulled from client.toml. Fixed that and statesync with memiavl works. Will let evmos run for a few more hours to generate snapshots for local restore testing.
Noticed a failed statesync snapshot. Suspect that memiavl pruned the oldest height before the statesync snapshot could finish. Future statesync snapshots were not generated. Increased memiavl snapshots held to allow more time for statesync export.
Do you think it would help to document the relation between these app.toml settings so that a configuration is practical?
pruning-keep-recent = "500000"
[memiavl]
snapshot-interval = 1000
snapshot-keep-recent = 1
[state-sync]
snapshot-interval = 100000
snapshot-keep-recent = 5
Statesync for evmos took about 1h40m to download and apply 2924 chunks over p2p. Snapshot was old and took 4 hours to BlockSync to the latest block.
Locally disk reads are fast. Reading all 2924 chunks takes only 1m30s.
Will retry again when a local snapshot is successfully created for a more recent height.
Noticed a failed statesync snapshot. Suspect that memiavl pruned the oldest height before the statesync snapshot could finish. Future statesync snapshots were not generated. Increased memiavl snapshots held to allow more time for statesync export.
do you have the error message? Pruning is just deleting the files from filesystem, according to my understanding, the already opened file handlers should still work as normal, including the mmap regions created on that, should still works until you close them.
Do you think it would help to document the relation between these app.toml settings so that a configuration is practical?
There could be different combinations, but for most of the use cases, I think state-sync.snapshot-interval
can be simply set to the same as memiavl.snapshot-interval
when enabled, so a state sync snapshot is created after each memiavl snapshot is created.
We set memiavl.snapshot-interval
to 10000
, it can even be larger if you can accept a little bit longer worst-case start-up time, 10000
translates to a few seconds in our case.
I'd also recommend setting the memiavl.snapshot-interval
much larger during catching-up, but that needs adjusting manually later on.
The old pruning-*
options have no effect on memiavl.
Noticed that there is a fair amount of code changes
I think most of the code changes are for the versiondb integration? memiavl integration itself should be fairly simple.
and dependency on the cronos chain repository (github.com/crypto-org-chain/cronos/memiavl) necessary to get enable memiavl for a new chain. That might be more than a chain developer is able to do and maintain. Another thought is to bring memiavl mainline into cosmos/iavl or cosmos/memiavl. yihuang, you are most familiar so you can probably give a good comment on options or best way to implement.
it's a good question on long-term maintenance, we had a few discussions on storage workgroup calls before, but the decision is to keep it as a third-party thing. personally I'm totally fine and supportive if we maintain it under cosmos org.
There could be different combinations, but for most of the use cases, I think
state-sync.snapshot-interval
can be simply set to the same asmemiavl.snapshot-interval
when enabled, so a state sync snapshot is created after each memiavl snapshot is created. We setmemiavl.snapshot-interval
to10000
, it can even be larger if you can accept a little bit longer worst-case start-up time,10000
translates to a few seconds in our case. I'd also recommend setting thememiavl.snapshot-interval
much larger during catching-up, but that needs adjusting manually later on. The oldpruning-*
options have no effect on memiavl.
Sharing a problem encountered for snapshots using memiavl + evmosd on mainnet
ERR failed to create state snapshot err="failed to generate snapshot chunk 0: snapshot is not created yet: height: 19266752" height=192667
Looking at snapshots directory, initial restore height of 19072000 and the next snapshot is available. The rest appear to fail
28G 19072000
28G 19080000
1.5K 19150000
1.5K 19160000
1.5K 19170000
1.5K 19180000
1.5K 19190000
1.5K 19217745
1.5K 19224746
1.5K 19231747
1.5K 19238748
1.5K 19245749
1.5K 19252750
1.5K 19259751
1.5K 19266752
data/memiavl.db has data as expected
memiavl.db
512 current
512 LOCK
171G snapshot-00000000000019246541
171G snapshot-00000000000019251544
171G snapshot-00000000000019256547
171G snapshot-00000000000019261550
171G snapshot-00000000000019266553
171G snapshot-00000000000019271556
evmosd snapshots list only shows the 2 successful data/snapshot heights
% evmosd snapshots list
height: 19080000 format: 3 chunks: 2923
height: 19072000 format: 3 chunks: 2924
This is the current app.toml settings
[memiavl]
async-commit-buffer = 0
cache-size = 1000
enable = true
snapshot-interval = 5003
snapshot-keep-recent = 5
zero-copy = false
[state-sync]
snapshot-interval = 7001
snapshot-keep-recent = 2
Are you able to see the same issue on your side? Any alternate settings that you recommend checking?
Sharing a problem encountered for snapshots using memiavl + evmosd on mainnet
ERR failed to create state snapshot err="failed to generate snapshot chunk 0: snapshot is not created yet: height: 19266752" height=192667
Looking at snapshots directory, initial restore height of 19072000 and the next snapshot is available. The rest appear to fail
The fifth parameter in memiavlstore.SetupMemIAVL is supportExportNonSnapshotVersion bool
, where false
means don't support taking a snapshot on non-snapshot versions.
The reason is there's a specialized algorithm when taking state-sync snapshot from a memiavl snapshot directly, which is more efficient than the generic algorithm that traverses the tree (although in practice, the speed difference seems marginal), so you have two options here:
supportExportNonSnapshotVersion
to true
, and it'll support exporting snapshots at any version.state-sync.snapshot-interval
to be multiples of memiavl.snapshot-interval
, so when you take a state-sync snapshot, there's always a memiavl snapshot for it.I think the second option should be good enough for the majority of use cases, that is why we set the supportExportNonSnapshotVersion
to false
by default.
Adjusted snapshot settings and retrying
The fifth parameter in memiavlstore.SetupMemIAVL
Cosmos code has many unnamed variables. This can make adding a configuration variable difficult (find if a parameter exists, determine if parameter can be passed, add parameter in calling, edit function call in called library)
Is there any "guidance" against code structure such as:
baseAppOptions = memiavlstore.SetupMemIAVL(common, options)
baseAppOptions.setSupportExportNonSnapshotVersion(true)
baseAppOptions.FinishSetup()
personally I'm totally fine and supportive if we maintain it under cosmos org.
Would you prefer to start that yourself? Or could we prepare a repository and move that to cosmos org once it makes sense?
Do you think it is generally "safe" to add memiavl to any chain of any sdk version with this pattern of code?
baseAppOptions = memiavlstore.SetupMemIAVL(logger, homePath, appOpts, false, false, baseAppOptions)
// Setup Mempool and Proposal Handlers
baseAppOptions = append(baseAppOptions, func(app *baseapp.BaseApp) {
Cosmos code has many unnamed variables. This can make adding a configuration variable difficult (find if a parameter exists, determine if parameter can be passed, add parameter in calling, edit function call in called library)
yeah, agree on this specific option, but it's usually static, there are several other options that are configurable in app.toml
, we can maybe add this one to config file as well.
Do you think it is generally "safe" to add memiavl to any chain of any sdk version with this pattern of code?
I think so, we've been using on production for a while, the multistore wrapper might need some adjustments when updating to different sdk versions, we have compatible versions for sdk 0.46 and 0.47, 0.50 is on draft PR.
we have compatible versions for sdk 0.46 and 0.47, 0.50 is on draft PR.
Are the compatible version branches on https://github.com/crypto-org-chain/cronos/tree/main/memiavl ?
we have compatible versions for sdk 0.46 and 0.47, 0.50 is on draft PR.
Are the compatible version branches on https://github.com/crypto-org-chain/cronos/tree/main/memiavl ?
Yeah, currently, you need to follow cronos release branch, v1.0.x
for SDK 0.46, main
for SDK 0.47.
I'm preparing a standalone repo at https://github.com/crypto-org-chain/memiavl, will make public soon.
0.50 is on draft PR.
Please share when available. Would like to run on a v0.50 testnet
0.50 is on draft PR.
Please share when available. Would like to run on a v0.50 testnet
it's available here, https://github.com/crypto-org-chain/cronos/pull/1331, it's able to build and run basic devnet, but still many tests to fix. the changes to memiavl and store wrappers are straightforward though.
main
for SDK 0.47.
Implemented memiavl for gaiad in PR https://github.com/cosmos/gaia/pull/2973
memiavl works well for gaiad in theta testnet
statesync node over p2p then put into validator mode. snapshots creation also works. No issues so far.
If you have any comments/suggestions regarding method of integration for chains, your feedback is always welcome.
@yihuang Do you know if the statesync snapshot/restore works properly for memiavl + cosmwasm module chains?
follow cronos release branch, v1.0.x for SDK 0.46,
Was not able to locate memiavl in: https://github.com/crypto-org-chain/cronos/tree/release/v1.0.x
Found this with a cosmos-sdk v0.46 reference. Is this the right branch for v0.46? https://github.com/crypto-org-chain/cronos/blob/memiavl/v0.0.5/memiavl/go.mod
@yihuang Do you know if the statesync snapshot/restore works properly for memiavl + cosmwasm module chains?
I think so, it only replaces root ms, the other part is still managed by sdk, but we didn't tested though.
Was not able to locate memiavl in: https://github.com/crypto-org-chain/cronos/tree/release/v1.0.x
Found this with a cosmos-sdk v0.46 reference. Is this the right branch for v0.46? https://github.com/crypto-org-chain/cronos/blob/memiavl/v0.0.5/memiavl/go.mod
Also is it difficult to modify for cosmos-sdk v0.45?
Was not able to locate memiavl in: https://github.com/crypto-org-chain/cronos/tree/release/v1.0.x
Found this with a cosmos-sdk v0.46 reference. Is this the right branch for v0.46? https://github.com/crypto-org-chain/cronos/blob/memiavl/v0.0.5/memiavl/go.mod
It's a bit hard to manage the release, it'll be easier when using a standalone repo, I suggest following the Cronos's go.mod
itself, for example, right now it uses v0.0.0-20240130060137-c12016e4052e
:
https://github.com/crypto-org-chain/cronos/blob/release/v1.0.x/go.mod#L222
Also is it difficult to modify for cosmos-sdk v0.45?
I think not difficult, since the store interfaces don't change much from 0.45 to 0.46.
I think not difficult, since the store interfaces don't change much from 0.45 to 0.46.
Attempting to create a v0.45 branch for memiavl based on Jackal Chain
Here is the current progress code with current errors. Rootmulti is not connecting properly yet.
https://github.com/chillyvee/cronos/tree/cv045wip
https://github.com/chillyvee/canine-chain/tree/cv3.1.3memiavl
data/memiavl.db directory appears
With app.toml [state-sync] configured the error is:
failed to load latest version: state sync snapshots require a rootmulti store
With app.toml [state-sync] removed the error is:
panic: uncached KVStore is not supposed to be accessed directly
node canined starts normally after app.toml [memiavl] is removed
could it be a problem with cosmwasm trying to pin codes early? (InitializePinnedCodes)
goroutine 1 [running]:
github.com/crypto-org-chain/cronos/store/rootmulti.(*Store).GetKVStore(0x0?, {0x0?, 0x0?})
github.com/crypto-org-chain/cronos/store@v0.0.0-00010101000000-000000000000/rootmulti/store.go:231 +0x25
github.com/cosmos/cosmos-sdk/types.Context.KVStore({{0x2aa5c60, 0x3c11940}, {0x7fd020267df8, 0xc0015a4a90}, {{0x0, 0x0}, {0x0, 0x0}, 0x0, {0x0, ...}, ...}, ...}, ...)
github.com/cosmos/cosmos-sdk@v0.45.17/types/context.go:253 +0x92
github.com/CosmWasm/wasmd/x/wasm/keeper.Keeper.InitializePinnedCodes({{0x2a8c2a0, 0xc000ea0910}, {0x2abc920, 0xc0002f6470}, {0x2a907d0, 0xc0000be750}, {0x2a86ca0, 0xc000eb02e0}, {0x2a863e0, 0xc0001e30f8}, ...}, ...)
github.com/CosmWasm/wasmd@v0.32.0/x/wasm/keeper/keeper.go:890 +0x79
github.com/jackalLabs/canine-chain/v3/app.NewJackalApp({0x2aa5c98, 0xc000f8c900}, {0x2ab78b8, 0xc000186128}, {0x0, 0x0}, 0x1, 0xc001168e40, {0xc000056468, 0x14}, ...)
github.com/jackalLabs/canine-chain/v3/app/app.go:1035 +0x9fd8
main.appCreator.newApp({{{0x2aabdb8, 0xc0001204e0}, {0x2abc920, 0xc0002f6470}, {0x2ab4b60, 0xc00047e4c0}, 0xc000672468}}, {0x2aa5c98, 0xc000f8c900}, {0x2ab78b8, ...}, ...)
github.com/jackalLabs/canine-chain/v3/cmd/canined/root.go:221 +0xa2c
github.com/cosmos/cosmos-sdk/server.startInProcess(_, {{0x0, 0x0, 0x0}, {0x2ac6d90, 0xc000fa2d80}, {0xc00013c878, 0x8}, {0x2aabd70, 0xc0002f6470}, ...}, ...)
github.com/cosmos/cosmos-sdk@v0.45.17/server/start.go:260 +0x532
github.com/cosmos/cosmos-sdk/server.StartCmd.func2(0xc000347900?, {0x3c11940?, 0x0?, 0x0?})
github.com/cosmos/cosmos-sdk@v0.45.17/server/start.go:128 +0x1e8
github.com/spf13/cobra.(*Command).execute(0xc000276c00, {0x3c11940, 0x0, 0x0})
github.com/spf13/cobra@v1.7.0/command.go:940 +0x87c
github.com/spf13/cobra.(*Command).ExecuteC(0xc000313800)
github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5
github.com/spf13/cobra.(*Command).Execute(...)
github.com/spf13/cobra@v1.7.0/command.go:992
github.com/spf13/cobra.(*Command).ExecuteContext(...)
github.com/spf13/cobra@v1.7.0/command.go:985
github.com/cosmos/cosmos-sdk/server/cmd.Execute(0x3b1b298?, {0xc000056468, 0x14})
github.com/cosmos/cosmos-sdk@v0.45.17/server/cmd/execute.go:36 +0x167
main.main()
github.com/jackalLabs/canine-chain/v3/cmd/canined/main.go:15 +0x26
Removing wasm InitializePinnedCodes and disabling state-sync enables app start. Suspect this is okay since memiavl directly triggers state-sync snapshot Create() after memiavl snapshot is finished.
[memiavl]
async-commit-buffer = 0
cache-size = 1000
enable = true
snapshot-interval = 61
snapshot-keep-recent = 3
zero-copy = false
#[state-sync]
# snapshot-interval = 61
# snapshot-keep-recent = 3
and editing canined/app/app.go
diff --git a/app/app.go b/app/app.go
index d712b96c..8513e492 100644
--- a/app/app.go
+++ b/app/app.go
@@ -119,7 +119,7 @@ import (
tmjson "github.com/tendermint/tendermint/libs/json"
"github.com/tendermint/tendermint/libs/log"
tmos "github.com/tendermint/tendermint/libs/os"
- tmproto "github.com/tendermint/tendermint/proto/tendermint/types"
+ //tmproto "github.com/tendermint/tendermint/proto/tendermint/types"
dbm "github.com/tendermint/tm-db"
"github.com/CosmWasm/wasmd/x/wasm"
@@ -1029,12 +1029,13 @@ func NewJackalApp(
if err := app.LoadLatestVersion(); err != nil {
tmos.Exit(fmt.Sprintf("failed to load latest version: %s", err))
}
- ctx := app.BaseApp.NewUncachedContext(true, tmproto.Header{})
-
+ //ctx := app.BaseApp.NewUncachedContext(true, tmproto.Header{})
// Initialize pinned codes in wasmvm as they are not persisted there
+ /*
if err := app.wasmKeeper.InitializePinnedCodes(ctx); err != nil {
tmos.Exit(fmt.Sprintf("failed initialize pinned codes %s", err))
}
+ */
}
data/memiavl.db/* fills up with data quickly after start
But fails near end of statesync restore where cosmwasm attempts to restore wasm files
INF Fetching snapshot chunk chunk=37 format=1 height=6802000 module=statesync total=38
INF Applied snapshot chunk to ABCI app chunk=33 format=1 height=6802000 module=statesync total=38
INF Applied snapshot chunk to ABCI app chunk=34 format=1 height=6802000 module=statesync total=38
INF Applied snapshot chunk to ABCI app chunk=35 format=1 height=6802000 module=statesync total=38
INF Applied snapshot chunk to ABCI app chunk=36 format=1 height=6802000 module=statesync total=38
panic: uncached KVStore is not supposed to be accessed directly
goroutine 224 [running]:
github.com/crypto-org-chain/cronos/store/rootmulti.(*Store).GetKVStore(0x0?, {0x0?, 0x0?})
github.com/crypto-org-chain/cronos/store@v0.0.0-00010101000000-000000000000/rootmulti/store.go:231 +0x25
github.com/cosmos/cosmos-sdk/types.Context.KVStore({{0x2aa3b20, 0x3c0f940}, {0x7f419807b578, 0xc000dfcd00}, {{0x0, 0x0}, {0x0, 0x0}, 0x67ca50, {0x0, ...}, ...}, ...}, ...)
github.com/cosmos/cosmos-sdk@v0.45.17/types/context.go:253 +0x92
github.com/CosmWasm/wasmd/x/wasm/keeper.Keeper.InitializePinnedCodes({{0x2a8a160, 0xc000ef85d0}, {0x2aba7e0, 0xc0002d4340}, {0x2a8e690, 0xc0005ff320}, {0x2a84b60, 0xc00037c020}, {0x2a842a0, 0xc00022d378}, ...}, ...)
github.com/CosmWasm/wasmd@v0.32.0/x/wasm/keeper/keeper.go:890 +0x79
github.com/CosmWasm/wasmd/x/wasm/keeper.finalizeV1({{0x2aa3b20, 0x3c0f940}, {0x7f419807b578, 0xc000dfcd00}, {{0x0, 0x0}, {0x0, 0x0}, 0x67ca50, {0x0, ...}, ...}, ...}, ...)
github.com/CosmWasm/wasmd@v0.32.0/x/wasm/keeper/snapshotter.go:120 +0x98
github.com/CosmWasm/wasmd/x/wasm/keeper.(*WasmSnapshotter).processAllItems(0xc000121620, 0x1?, {0x2a84880, 0xc0444568a0}, 0x2832b38, 0x2832b28)
github.com/CosmWasm/wasmd@v0.32.0/x/wasm/keeper/snapshotter.go:155 +0x2b5
github.com/CosmWasm/wasmd/x/wasm/keeper.(*WasmSnapshotter).Restore(0x1cd5340?, 0xc0000caff0?, 0xb664a70?, {0x2a84880?, 0xc0444568a0?})
github.com/CosmWasm/wasmd@v0.32.0/x/wasm/keeper/snapshotter.go:96 +0x36
github.com/cosmos/cosmos-sdk/snapshots.(*Manager).restoreSnapshot(0xc000510850, {0x67ca50, 0x1, 0x26, {0xc012f9b2a0, 0x20, 0x20}, {{0xc04107c600, 0x26, 0x40}}}, ...)
github.com/cosmos/cosmos-sdk@v0.45.17/snapshots/manager.go:314 +0x1f2
github.com/cosmos/cosmos-sdk/snapshots.(*Manager).Restore.func1()
github.com/cosmos/cosmos-sdk@v0.45.17/snapshots/manager.go:271 +0x47
created by github.com/cosmos/cosmos-sdk/snapshots.(*Manager).Restore in goroutine 148
github.com/cosmos/cosmos-sdk@v0.45.17/snapshots/manager.go:270 +0x205
Looks more like a cosmwasm + memiavl problem and not directly a cosmos-sdk v0.45 + memiavl problem.
Disabling pinned codes after wasm restore finalize x/wasm/keeper/snapshotter.go
func finalizeV1(ctx sdk.Context, k *Keeper) error {
return nil
// FIXME: ensure all codes have been uploaded?
//return k.InitializePinnedCodes(ctx)
}
But restore fails with appHash verification problem
ERR appHash verification failed actual="��\x1f�~t�=+\x06�@\x14\x15�}0-TmznǙ̶x\b\f<��" expected="F�\x1d+��:��\x1d�^��n)�\x14p\x7f\x1e\x17\x02f\x195�r�\b�a" module=statesync
ERR State sync failed err="snapshot restoration failed: verification failed" module=statesync
Have you tried memiavl with other cosmwasm enabled chains?
If you have a moment, can you suggest how to handle cosmwasm wasm file restores? If you don't see reason, please feedback so we can reach out to confio.
wasm InitializedPinnedCodes calls cosmos-sdk which calls rootmulti GetKVStore
github.com/crypto-org-chain/cronos/store/rootmulti.(*Store).GetKVStore(0x0?, {0x0?, 0x0?})
github.com/crypto-org-chain/cronos/store@v0.0.0-00010101000000-000000000000/rootmulti/store.go:231 +0x25
github.com/cosmos/cosmos-sdk/types.Context.KVStore({{0x2aa3b20, 0x3c0f940}, {0x7f419807b578, 0xc000dfcd00}, {{0x0, 0x0}, {0x0, 0x0}, 0x67ca50, {0x0, ...}, ...}, ...}, ...)
github.com/cosmos/cosmos-sdk@v0.45.17/types/context.go:253 +0x92
github.com/CosmWasm/wasmd/x/wasm/keeper.Keeper.InitializePinnedCodes({{0x2a8a160, 0xc000ef85d0}, {0x2aba7e0, 0xc0002d4340}, {0x2a8e690, 0xc0005ff320}, {0x2a84b60, 0xc00037c020}, {0x2a842a0, 0xc00022d378}, ...}, ...)
github.com/CosmWasm/wasmd@v0.32.0/x/wasm/keeper/keeper.go:890 +0x79
In this case cronos/store/rootmulti.(*Store).GetKVStore(0x0?, {0x0?, 0x0?}) immediately panics
Adding cosmwasm seems to require a way to directly get a named store for wasmd keeper to read and write to.
Should cronos/store/rootmulti.GetKVStore() proxy a call to CacheMultiStore to avoid panic: uncached KVStore is not supposed to be accessed directly
?
Attempted to give direct access to stores in this commit: https://github.com/chillyvee/cronos/commit/4debbd6457d4442528c63c434f3510587cdfb19e
This avoids wasm crashing on restore/startup
wasm files appear
But maybe the function call is wrong for GetKVStore so the apphash doesn't match after wasm restore.
Statesync restore without memiavl (enabled=false) succeeds:
INF Verified ABCI app appHash="��o&\">����O$_\r1:.�j�XZ\x02\bˀ6���J�" height=6810000 module=statesync
INF Snapshot restored format=1 hash="��|IB�c\f��P��TDM�)�M\x02)Q�/��\x18\x15��U" height=6810000 module=statesync
uncached KVStore is not supposed to be accessed directly
You can try remove this check, it should be fine especially for read-only cases
uncached KVStore is not supposed to be accessed directly
You can try remove this check, it should be fine especially for read-only cases
For statesync cosmwasm restore, write access is required. Can you recommend a different object to return?
It's a bit hard to manage the release, it'll be easier when using a standalone repo, I suggest following the Cronos's
go.mod
itself, for example, right now it usesv0.0.0-20240130060137-c12016e4052e
:
Attempted to use the recommended branch/commit for v0.46 target using gitopia chain without cosmwasm
Without memiavl the chain statesync restore is successful
Followed the replace to utilize memiavl within the same branch https://github.com/crypto-org-chain/cronos/blob/release/memiavl/v0.0.x/store/go.mod#L135
github.com/crypto-org-chain/cronos/memiavl => ../memiavl
This branch downgrades the commit above to match cosmos-sdk v0.46.13 https://github.com/chillyvee/cronos/tree/gitopia/cv0.46.13store
This branch of gitopia uses the branch above https://github.com/chillyvee/gitopia/commit/3a4f8fe37b5fbfdca59ebc69f19e4d17e97775e1
On statesync restore, there is a apphash mismatch error
It's a bit hard to manage the release, it'll be easier when using a standalone repo, I suggest following the Cronos's
go.mod
itself, for example, right now it usesv0.0.0-20240130060137-c12016e4052e
:Attempted to use the recommended branch/commit for v0.46 target using gitopia chain without cosmwasm
Without memiavl the chain statesync restore is successful
Followed the replace to utilize memiavl within the same branch https://github.com/crypto-org-chain/cronos/blob/release/memiavl/v0.0.x/store/go.mod#L135
github.com/crypto-org-chain/cronos/memiavl => ../memiavl
This branch downgrades the commit above to match cosmos-sdk v0.46.13 https://github.com/chillyvee/cronos/tree/gitopia/cv0.46.13store
This branch of gitopia uses the branch above chillyvee/gitopia@3a4f8fe
On statesync restore, there is a apphash mismatch error
with sdk 0.46/0.45, the option memiavlSdk46Compact
should be set to true
, can you give that a try?
there's a breaking change between sdk 0.47 and older versions on root hash calculation, that's this option is all about.
uncached KVStore is not supposed to be accessed directly
You can try remove this check, it should be fine especially for read-only cases
For statesync cosmwasm restore, write access is required. Can you recommend a different object to return?
Just tagged a new version, please update to memiavl v0.0.6
, store v0.0.6
, (versiondb v0.0.6
if used).
The above check is removed, there's no fundamental issue with accessing uncached stores, just be aware that you won't read uncommitted changes.
with sdk 0.46/0.45, the option
memiavlSdk46Compact
should be set totrue
, can you give that a try?
Success on statesync for jackal, memiavlSdk46Compact = true Success on statesync for gitopia, memiavlSdk46Compact = true
Just tagged a new version, please update to memiavl v0.0.6, store v0.0.6, (versiondb v0.0.6 if used).
Had some trouble during go mod tidy due to some cosmos-sdk dependencies. Will feedback later after resolved.
Tagging with matching cosmos-sdk versions can be helpful for using the right version. memiavl/cosmos-sdk0.46.13
for example. You can use the idea after memiavl is split into a dedicated repo.
with sdk 0.46/0.45, the option
memiavlSdk46Compact
should be set totrue
, can you give that a try?Success on statesync for jackal, memiavlSdk46Compact = true Success on statesync for gitopia, memiavlSdk46Compact = true
great!
Just tagged a new version, please update to memiavl v0.0.6, store v0.0.6, (versiondb v0.0.6 if used).
Had some trouble during go mod tidy due to some cosmos-sdk dependencies. Will feedback later after resolved.
Tagging with matching cosmos-sdk versions can be helpful for using the right version.
memiavl/cosmos-sdk0.46.13
for example. You can use the idea after memiavl is split into a dedicated repo.
you'll need a little bit change in the versiondb setup code, please check the patch here: https://github.com/crypto-org-chain/cronos/pull/1339
you'll need a little bit change in the versiondb
Thank you. Will try again when there is time and share the result
Optimistic Export and Import aims to reduce p2p statesync recovery times.
Validators attempting to recover nodes for cosmos-sdk chains with significant state size may be slashed for downtime prior to recovery.
Although all data is verified during tree build, rebuilding the entire IAVL tree using minimal data incurs high CPU overhead.
Evmos for example takes over 16 hours to rebuild IAVL on a "fast" CPU/DISK machine and can appear to "never finish". The node must then fast-sync 16 hours of missed blocks which can take an additional 4-5 hours. Slashing for downtime is within 30 hours meaning a validator must succeed on the first try or be slashed.
Since statesync appears to "hang", validators download a 244GB file from Polkachu to restore data directories. This is possible because Polkachu is a trusted party. Validators do not appear to validate the correctness of the IAVL tree so this method is less trusted but necessary. The problem is when Polkachu only makes 1 snapshot every 24 hours. A validator may need to fast-sync 23 hours of lost block, wait for the next Polkachu snapshot, or try another random less known snapshot.
Optimistic statesync enables faster database import at the restoring node by directly inserting the final key/value pairs into the database. As long as validators statesync from a trusted node, this can be a good alternative to downloading data directory zip files.
Cosmos-sdk should broadcast a different snapshot format so that nodes can select optimistic statesync node sources.
There are some code locations below where there may be more efficient methods to obtain the raw key/value storage than recalculation.
Comments welcome.