Add hooks to allow app modules to add things to state-sync

michaelfig commented 4 years ago

Summary

Please provide a way for app modules (like Agoric's x/swingset) to add additional state to the snapshots generated by the state-sync mechanism for the new clients to download.

@erikgrinaker said on Discord:

Currently the Cosmos SDK automatically dumps IAVL data from the rootmulti store, there aren't any hooks to extend this - but feel free to open a feature request for it, although it'll likely need some design work.

Problem Definition

For the https://agoric.com chain, we maintain some crucial (and large) state outside of the IAVL tree. We'd like our x/swingset module to be able to save external state to snapshots so that when the snapshot is downloaded by clients doing state-sync, they will be able to use our external state to catch up to the external state of the chain as well.

We use the Cosmos SDK as I/O from our deterministic Javascript world. So, if a new validator is going to skip past (many) blocks while using state-sync, that validator also needs to download enough information to allow it to skip past evaluating all the blocks to update the JS heap.

The feature will accommodate chains that have state outside of the IAVL tree.

There will be greater complexity to include this feature, but it will make state-sync more general.

Proposal

Somehow provide a callback so that app modules can choose to publish additional state to the snapshots. Also, allow the module to decide how to interpret that downloaded state during state-sync. I don't have suggestions for how to accomplish this and am leaning on the Cosmos SDK team to decide the best course of action.

On Discord, @zmanian said:

I suspect the Agoric heap state will be large enough that it should be sent as multiple chunks

and @erikgrinaker replied:

That shouldn't really be a problem, we currently just chunk a byte stream of serialized data items. It shouldn't be too hard to extend this with additional item types, but needs some sort of API to register schemas and (de)serializers and such. I suppose in the most naïve case the API could just export a binary reader/writer, although that might prevent us from implementing any sort of incremental chunk verification scheme later.

For Admin Use

[ ] Not duplicate issue
[ ] Appropriate labels applied
[ ] Appropriate contributors tagged
[ ] Contributor assigned/self-assigned

michaelfig commented 4 years ago

Can somebody please triage this request? I'd like to discuss what we can do to implement it.

zmanian commented 4 years ago

Whats the status of heap snapshots for Agoric?

Because that would be a dependency.

michaelfig commented 4 years ago

We don't need heap snapshots to take advantage of state sync. Even with heap snapshots, we still need to supply a transcript that is the delta from the snapshot to the state as of the synced block. Without heap snapshots, we can supply transcripts that go back to genesis, which is not as optimised, but still correct.

But without this per-module state sync hook, we can't even send transcripts, which mean we can't do state sync at all.

zmanian commented 4 years ago

My impression is that a transcript if effectively the same as replaying every block?

michaelfig commented 4 years ago

My impression is that a transcript if effectively the same as replaying every block?

Basically, but at the smart contract level, not the Cosmos SDK level. Heap snapshots are essentially optimised transcripts.

My point is that we can test and debug the Cosmos module state sync hook even before we have full application heap snapshot support (which is in the pipeline).

zmanian commented 4 years ago

can you have a begin/ end blocker right now that captures the transcript as data structure for a certain height?

michaelfig commented 4 years ago

can you have a begin/ end blocker right now that captures the transcript as data structure for a certain height?

We've talked about doing this as an experiment. Last time we tried this we had performance problems, but that was a year ago, and we've learned more since then.

We'll definitely try it again to see if it works to put everything in the IAVL tree.

zmanian commented 4 years ago

The state sync systems doesn't require putting everything in the IAVL tree. I think that would be performance disaster.

It just requires storing and retrieving things in a chunkable verifiable form at the snapashot height.

There needs to be some work the SnapshotManager to support this but this is the path I'd recommend.

michaelfig commented 4 years ago

The state sync systems doesn't require putting everything in the IAVL tree. I think that would be performance disaster.

Yeah, that was our first experience. Not only the transcripts, but the entire kernel database need to saved for a given snapshot, which is a lot of stuff.

There needs to be some work the SnapshotManager to support this but this is the path I'd recommend.

Sounds good. Any way we can get this on people's radar for the folks who are actually going to implement it? I don't know anything about SnapshotManager yet, and our use case would be satisfied by something fairly generic, so I hope I'm not the one to do the work. :wink:

michaelfig commented 4 years ago

can you have a begin/ end blocker right now that captures the transcript as data structure for a certain height?

Oh, were you meaning a data structure to send to the SnapshotManager? Yes, we can do that.

okwme commented 4 years ago

Hi all, The ICF and core SDK contributors are still working through how to prioritize community requests for features within the available developer resources dedicated to the Cosmos SDK.

For this one it would be great to understand if there were other SDK users who are storing chain state outside of IAVL and also want this feature to utilize state sync.

I'll come back to this thread with updates on that process as it should be eventually reflected in the CONTRIBUTING.md section of this repo.

In the mean time I'd encourage you to continue fleshing out the needs from a user perspective as well as seeking out other users who would benefit from this feature to factor into the conversation about prioritization.

alexanderbez commented 4 years ago

To be quite honest, I'm not totally understanding the problem and context here. Is this a pretty niche use-case? I'm having a hard time wrapping my head around how something like this would even work.

robert-zaremba commented 4 years ago

@alexanderbez - the use-case is to enable a non-SDK DB sync within the snapshot sync mechanism.

Apparently, Agoric is using a different module for one of their modules and they would like to benefit a snapshot-sync mechanism.

robert-zaremba commented 4 years ago

The state sync systems doesn't require putting everything in the IAVL tree.

Correct. But currently it only supports the Multistore.

@michaelfig What DB do you use?

This is how I see it:

We will need to define an interface for an external DB Reader and Writer.
Reader and Writer will need to be implemented on the module side.
I would strongly insist that Reader and Writer will use checkpointed state of the DB. Otherwise the whole process will be staging on the module replaying module messages from the genesis. And things can go nasty if some of this messages will interact with other modules and change the multistore state.
Once snapshot is completed, the node will use a standard mechanism to replay transactions from the last block in the snapshot till now.

robert-zaremba commented 4 years ago

@michaelfig for helping prioritizing this, could you asses your level of support, how much do you need this and by when?

alexanderbez commented 4 years ago

Seems like a very niche use-case. I would like to see an ADR or RFC describing the exact approach taken care. I wary of bringing additional complexities into the SDK's store logic, but if this can be done cleanly it sounds reasonable to me.

michaelfig commented 4 years ago

The state sync systems doesn't require putting everything in the IAVL tree.

That's a relief.

Correct. But currently it only supports the Multistore.

@michaelfig What DB do you use?

We use a set of databases optimised for our particular access patterns, and we'd like the flexibility to be able to change their implementation. The representation is not exposed to Cosmos SDK, and is not necessarily portable to other environments.

We will need to define an interface for an external DB Reader and Writer.

Reader and Writer will need to be implemented on the module side.

I would strongly insist that Reader and Writer will use checkpointed state of the DB.

It may be as simple as adding a hook for the state-sync system to tell our module to checkpoint, then our module can put the portable checkpoint into the multistore.

Once snapshot is completed, the node will use a standard mechanism to replay transactions from the last block in the snapshot till now.

Also, once the snapshot is restored, we would need another hook to tell our module to reconstruct the optimised databases from the current multistore before the transaction replay begins.

Does that sound more palatable?

robert-zaremba commented 4 years ago

I wouldn't go through multistore. As @erikgrinaker mentioned, a much better design is to register additional readers / writers to the snapshot manager. This would function similarly to storing things into IAVL.

michaelfig commented 4 years ago

@michaelfig for helping prioritizing this, could you asses your level of support, how much do you need this and by when?

As of next week, our testnet will be on the 0.40 branch, but we won't be able to implement state-sync until these hooks exist. So, new validators will be unhappy if they have to sync from scratch, but it won't be any worse than our current testnet. I will need to disable state-sync while that's the case.

The need for state-sync will increase as time goes on if we have any long-lived epochs.

ethanfrey commented 3 years ago

This affects wasmd and all chains that will import it, so I guess less and less niche.

We just ran into this issue with a full node that was state-synced being able to handle any queries or tx to wasm modules. We store the full wasm blobs (~200-300kb each) on disk outside of the iavl tree for performance reasons. This works fine for everything until now, but not state sync.

It seems this is frozen. What is needed to support this? Is it a hook in the sdk? In tendermint? Who is in charge of prioritizing these cross-repo issues?

tac0turtle commented 3 years ago

This needs to be done in the sdk. It is mainly surrounding the extension of https://github.com/cosmos/cosmos-sdk/blob/e2f510afcc3478edab4a7d8d786dccbec333dddd/snapshots/store.go

zmanian commented 3 years ago

Brain dump of what needs to be done

You need to break the current invariant that all state needed to state sync is committed to in AppHash in the Tendermint header for every block. Thus only some snapshot configuration will be valid and other will panic the node and should probably be checked on startup.
There needs to be a system for providing additional objects and proofs to AppHash on blocks where a snapshot is expected to be valid. This has to be done in consensus. Ie. every full node must compute the app hash for all state syncable state.
The SnapShot manager needs to be able to learn that certain paths are resolved through getting data from an alternative to IAVL based MultiStore and it needs to be able to resolve these paths.

robert-zaremba commented 3 years ago

Architecture Call Notes

We discussed this issue during our Architecture Call. Notes.

robert-zaremba commented 3 years ago

It looks that this feature won't go into the v0.43 because we want to release it ASAP and we don't have any finalized design / API for this.

If there will be a strong support we can redo the v0.44 planning and break it into 2 and release this a bit earlier.

ethanfrey commented 3 years ago

Sorry to hear this, but I don't want to block everyone else's release over this. I would hope that this does remain high priority for design in the post-0.43 world

tac0turtle commented 3 years ago

This is technically a non breaking feature. Could we do it in 0.43.1?

robert-zaremba commented 3 years ago

This is technically a non breaking feature. Could we do it in 0.43.1?

This is only a patch release, normally without features. I guess the best thing would be to do a smaller 0.44 release than what was originally planned. cc: @clevinson

ethanfrey commented 3 years ago

This is only a patch release, normally without features.

Since the sdk is pre-1.0

0.43.0 -> 0.43.1 is general treated like a minor release, meaning non-breaking. not "bug fix only". Once you claim 1.0, you get the extra point to distinguish between point and minor releases.

In general, people would expect a 0.43 -> 0.44 release to be state machine breaking and likely (majorly?) API breaking.

At least this is what observers have seen and expected from Cosmos SDK numbering.

(It can be either 0.43.1 or 0.44.0, but I don't see the need ot bump to 0.44 if it is not state machine breaking)

clevinson commented 3 years ago

@iramiller from Provenance shared today in the Cosmos SDK Community Dev call that this issue was also a priority for their team to be addressed.

Agreed that we should be looking at addressing this soon after v0.43. I think it's a strong candidate and one of the issues in our backlog with the most requests from chain developers (agoric, cosmwasm, provenance all have voiced need for this).

webmaster128 commented 3 years ago

You need to break the current invariant that all state needed to state sync is committed to in AppHash in the Tendermint header for every block. Thus only some snapshot configuration will be valid and other will panic the node and should probably be checked on startup.

There needs to be a system for providing additional objects and proofs to AppHash on blocks where a snapshot is expected to be valid. This has to be done in consensus. Ie. every full node must compute the app hash for all state syncable state.

FWIW we address those kind of problems the following way in CosmWasm: Wasm blobs are stored in state next to the IAVL tree, as files on disk. However, those blobs are identified by their hash. This hash is stored in the IAVL tree. For every access of a blob, the file integrety is checked by hashing after/while reading. So indirectly, all file content is part of the app hash as well. Files can be missing or corrupted. But any interaction with a corrupted file will fail. This way you can have data availability issues, but no consensus issues.

zmanian commented 3 years ago

How does the node sync the WASM blobs?

webmaster128 commented 3 years ago

They are uploaded as part of a transaction. We compress them with gzip and store them in a bytes field, which is then 100-300 KB large. That works reasonably well, except when trying to sign with Ledger.

zmanian commented 3 years ago

But if if a node state syncs to the current state, it needs some way to get sync the wasm blobs that it needs? How does it get them?

webmaster128 commented 3 years ago

How does it get them?

Getting them via replaying transaction history is the only way currently. Thus this ticket I guess.

ethanfrey commented 3 years ago

But if if a node state syncs to the current state, it needs some way to get sync the wasm blobs that it needs? How does it get them?

This is exactly the issue Agoric and CosmWasm have. We ask the sdk team to provide us hooks to let us sync data not in the iavl.

I don't know how state sync is implemented, but if some extension point is exposed with docs, I will make our end work with it.

robert-zaremba commented 3 years ago

NOTE: We are also creating a hook system based on events: https://github.com/cosmos/cosmos-sdk/discussions/9656

ethanfrey commented 3 years ago

This is for hooks between modules (like staking and slashing), right?

We don't need that here - this is about allowing the app to customise how state sync works. I think this is just a confusion with terminology and no overlap in design/functionality

aaronc commented 3 years ago

Yes, these hooks are two completely different things.

For state sync, we can store the hash of something in the state tree but we still need a way to deliver it over state sync. We're happy to include this, someone just needs to work on a design. So far we have not had the bandwidth but it is still in our backlog.

michaelfig commented 3 years ago

For state sync, we can store the hash of something in the state tree but we still need a way to deliver it over state sync. We're happy to include this, someone just needs to work on a design. So far we have not had the bandwidth but it is still in our backlog.

After running Agoric's latest incentivised testnet phase without this feature, we had a lot of pain. Our chain is expected to have a steady-state of more work scheduled to do than fits in the current block, so we have high CPU utilisation which makes it hard to catch up if you didn't have your node running from genesis. Even as we continue to optimise our VM and take advantage of multiprocessing, that only opens the bottleneck for more requests to be processed faster, which will just lead to even more increased utilisation as the market reacts to the availability of compute power.

I kludged around it by rsyncing .appd/data directories between nodes and starting from there. That still took a half-hour or so for an RPC node to catch up because the state was stale before I was done transferring it. Agoric has recently implemented heap snapshots for our VM, so we have all the data needed for state sync, just no way to express it yet.

it is becoming increasingly urgent for us to have this feature before we can launch our mainnet. I realise in this project (like many), pressure rarely helps things along, but putting in the work helps things snowball.

Here's a HackMD doc in which we can hash out a strawman.

Tagging: @ethanfrey, @iramiller, @JimLarson (currently on holiday). I'd like to solicit your contributions especially to help ensure we're all in alignment. I hope we'll have something to put on the next architecture meeting's agenda.

ethanfrey commented 3 years ago

Thank you for moving this forward. I just did a quick read of your doc and here is my first feedback

Create an sdk.Context corresponding to that chosen multistore state, block height, etc.

Call BeginReadState(sdk.Context ctx) on (example) x/mymod with that Context

Context is the wrong object here as it maintains a mutable multistore. We need a different but similar object to return a reference to readonly multistore at a given height.

I would assume the module could use that to determine what data to sync (for wasmd we store all hashes in the multistore, and can look up original data by hash)

ethanfrey commented 3 years ago

I have, however, not looked at the "normal" state sync implementation.

I also have no idea what happens if they iavl prunes the height we are syncing while a sync is in progress. That should be defined or some sort of lock defined to avoid this.

michaelfig commented 3 years ago

Context is the wrong object here as it maintains a mutable multistore. We need a different but similar object to return a reference to readonly multistore at a given height.

"Different but similar" is a kind of warning bell for me. I was modelling this off of the fact that RPC queries are passed an sdk.Context, even though they cannot write to the multistore.

I would assume the module could use that to determine what data to sync (for wasmd we store all hashes in the multistore, and can look up original data by hash)

Yes, this is absolutely the intent.

I have, however, not looked at the "normal" state sync implementation.

Me neither. :blush: I should probably do that soon.

I also have no idea what happens if they iavl prunes the height we are syncing while a sync is in progress. That should be defined or some sort of lock defined to avoid this.

Since state sync only runs if there is no existing history, I'd be really surprised if the pruning starts to happen until state sync is over and we enter the normal cycle.

ethanfrey commented 3 years ago

Since state sync only runs if there is no existing history, I'd be really surprised if the pruning starts to happen until state sync is over and we enter the normal cycle.

I meant if pruning started on the node serving the data. This would apply for normal state sync.

We clearly should look at the design of that more before extending it. I am back from vacation and too busy this week.

michaelfig commented 3 years ago

I meant if pruning started on the node serving the data. This would apply for normal state sync.

Oh. State sync avoids this by asserting settings from app.toml, so that snapshots are only taken for blocks that are never pruned.

We clearly should look at the design of that more before extending it.

I've spent some time looking through this more carefully. Will update the doc with what I learn.

the-frey commented 3 years ago

Hi gang, late to the party here. We've been talking about this today in the Juno camp as obviously this is pretty important to the long-term stability of CosmWasm chains.

Juno core would like to offer a handsome bounty for getting this issue closed. That includes the design involved to get it done.

Sorry if this isn't quite the right place to talk about bounties, but we'd really like to see this unstuck and appreciate it's not that trivial to solve (and everybody is busy!)

ethanfrey commented 3 years ago

Who worked on the original state sync design?

I have no real idea there. Waiting for that "someone" to build some hooks, so we could just return the wasm blobs when needed.

Looking at https://github.com/cosmos/cosmos-sdk/blame/e2f510afcc3478edab4a7d8d786dccbec333dddd/snapshots/store.go it seems @erikgrinaker did most of this. He no longer seems to be working on Cosmos.

Hey Erik, would you tackle this for a nice set of Juno tokens? Terra might be willing to chip in as well. (They asked me for this multiple times and know about this issue as well)

ethanfrey commented 3 years ago

Yeah, he is all over the place in this code: https://github.com/cosmos/cosmos-sdk/blame/e2f510afcc3478edab4a7d8d786dccbec333dddd/snapshots/manager.go

I guess this explains why no one works on it. Original author is gone.

erikgrinaker commented 3 years ago

Hey Erik, would you tackle this for a nice set of Juno tokens? Terra might be willing to chip in as well. (They asked me for this multiple times and know about this issue as well)

Afraid I don't have any spare cycles for this, my current gig keeps me pretty busy.

It shouldn't be a huge lift though: a couple of hooks to fetch/apply a stream of binary state data, and a couple of snapshots.SnapshotItem Protobuf message types to embed them in the state sync snapshot Protobuf stream along with some metadata. Designing the API/protocol would be the bulk of the work.

aaronc commented 3 years ago

Thanks @the-frey 🙏. That's great to hear!

zmanian commented 2 years ago

IMO this is a blocker for Cosmwasm on the Hub so needs to be a near term priority

cosmos / cosmos-sdk