What to do about IAVL? - Githubissues

aaronc commented 3 years ago

This is linked to meta-issue #7096.

Summary

This issue discusses potential approaches to the ongoing maintenance or replacement of IAVL.

Problem Description

It our understanding that IAVL has effectively become an orphaned project within the Cosmos ecosystem. I don't have a good understanding of all of the issues related to IAVL, but welcome those with more context to share. /cc @ethanfrey @erikgrinaker @alexanderbez

Proposals

Overhaul IAVL

A complete overhaul has been proposed as one solution. I don't have the context fully for why this is needed, but would appreciate those who do to share those details here.

Replace IAVL

Replacing IAVL with a more mature alternative has been proposed as well. I'll just list out the alternatives that have been mentioned here:

https://github.com/oasisprotocol/oasis-core/tree/master/go/storage/mkvs
- @zmanian said: "It appears to support the same core features including state sync. I suspect it's of higher engineering quality and has a maintainer."
https://github.com/lazyledger/smt
- https://twitter.com/musalbas/status/1294624434226167810?s=21
- has no GC/pruning
- /cc @hxrts @marbar3778 @liamsi
https://github.com/yihuang/libra-jellyfish-merkle
- /cc @hxrts
- https://github.com/crypto-com/chain/blob/6d22c38c62778a60a77324e86d12a000171fb3b0/architecture-docs/adr-003.md

robert-zaremba commented 3 years ago

@musalbas , We would need to do simulation. I think it's hard to estimate the SMT part.

AVL & Merk tree expected lookup time is log(n), worst case is 2log(n) (n=number of keys)
AVL & Merk lookup time = 1 is only for exactly one key.
SMT expected lookup time (assuming we have correctly randomized keys), is also log(n).
SMT worst case lookup time is 256=log(N) (N=domain size) -- but that should happen only to few keys (if any).

Re: storing nodes by key: this will depend on the key convention, and an average key size.

For Merk is avg_key_size*n
For SMT it's n/c * log(n) for some constant c>1 So the question is: avg_key_size <? log(n)/c

robert-zaremba commented 3 years ago

For c=2.5 and n=2^25 we have avg_key_size <? 10
For c=2 and n=2^30 we have avg_key_size <? 15

I have no idea what the c could be except that it decreases when n increases.

robert-zaremba commented 3 years ago

An interesting paper has been recently published for storage commitments, which could be an alternative for SMT: Using Homomorphic hashes in coded blockchains

tarcieri commented 3 years ago

I see this issue's description already links to libra-jellyfish-merkle, but see also Compact Sparse Merkle Trees, which are a very similar idea:

https://osf.io/8mcnh/

hxrts commented 3 years ago

I guess github references from discussions → issues don't work just yet, but there is some discussion of this wrt parallelism here: https://github.com/cosmos/cosmos-sdk/discussions/8134#discussioncomment-229639

Also here's an ethresearch thread on the compact SMT paper tony posted above https://ethresear.ch/t/compact-sparse-merkle-trees/3741

UnitylChaos commented 3 years ago

I think under the propose separation of concerns (State storage as separate from storage commitments), it would be highly beneficial to consider storage solutions other than simple key/value stores. In particular classical SQL database systems.

For my own work I've been coming to the conclusion that it would be REALLY nice to have the state stored in an SQL database, which would allow for the use of multiple indexes and join queries in transaction processing. My understanding is that the mature SQL databases are extremely well supported, highly efficient, with built in support for caching, indexes, ACID compliance, complex queries, etc.

They may not be as fast for raw lookups as a simple key/value store, but I would bet any production grade SQL db would out perform IAVL/goleveldb (or any other Merkle tree on top of a kv store) for most modules. Especially any that make use of iterators, since AFAIK each step of the iterator requires (at minimum) another lookup (log(n)) in the nodedb. Whereas getting a series of results from an SQL table is done in a single query and is almost certainly O(log(n)) if the iterated key has an index.

An SQL DB would also trivialize the whole "MultiStore" process, by just using multiple tables, rather than a single tree with key prefixes. Table names could easily be prefixed by module name to create the same sort of module based isolation within a shared store.

The problem of storage commitments/proofs could then be dealt with separately, with the committed data being something like: "tablename : serialization_of_row" (or the hash of same). The SQL DB could also be used as the backend for an SMT / Merkle based commitment scheme, though long term it would probably be most efficient to use a cryptographic accumulator which isn't directly tree based. I think this could also be made most efficient if updates to the store happen in transaction processing phases, while changes to the storage commitment are queued up and processed in a batch in the commit phase.

I am currently working on a python based proof of concept for implementing an SMT (with batch inserts) on top of sqlite. If that experiment goes well, and there is at least some support for this path, I will likely try to build such a storage service. It would probably be in Haskell, as I'm hoping to use this along with kepler.

I also think it's worth planning ahead of time for cross language usage. ie whatever storage system / storage commitment scheme the Cosmos-SDK ends up using should be accessible through a separate service (such as gRPC) so that other language SDK implementations can reuse the same service, and so that replacement services could be constructed later which satisfy the same interface.

alexanderbez commented 3 years ago

We might just be the first blockchain framework that uses RDBMS 😆 . How would merkle proofs work? Would those come from the storage commitment layer? And if so, how is that tied to the relational/logical storage layer? i.e. how do you prove what's in SQL is actually there and part of a structure?

UnitylChaos commented 3 years ago

How would merkle proofs work? Would those come from the storage commitment layer?

Yes, the idea would be for the service to keep the storage commitment layer in sync with what is stored in the database tables. I'm thinking of the storage commitment layer as an abstract cryptographic accumulator. ie something which acts like a Set, with insert/remove operations and a way to construct proofs of membership/non-membership which can be checked against either the accumulator itself, or some digest of it (root hash in the case of a merkle tree style accumulator).

And if so, how is that tied to the relational/logical storage layer? i.e. how do you prove what's in SQL is actually there and part of a structure?

To keep them in sync, at the end of every block, the storage layer would need to send a set of additions/deletions to update the accumulator with. These would be hashes of the serializations of table rows (ie what a CSV file stores), prefixed with the name of the table they were from. So for example:

ADD:
Hash("bank_accounts : *address1* , *coins*")
Hash("bank_accounts : *address2* , 0")
DELETE:
Hash("bank_accounts : *address2*, *coins*")

Would be the result of a full transfer from address2 to address1.

So proving that some value is in the accumulator would be equivalent to a proof that the specified table has a row with that value.

It's basically like having two stores, one which is fast for inserting data and lookups (sql). And another one which doesn't have to be able to return the inserted data at all, but which is fast for creating/verifying proofs that a value was inserted (accumulator).

UnitylChaos commented 3 years ago

This would also obviate the need for #7099 and #7098, and simplify #7097.

tarcieri commented 3 years ago

We might just be the first blockchain framework that uses RDBMS

@alexanderbez Chain Core was using Postgres circa 2016, FWIW. Seemed to work pretty well there, and had some nice cloud deployment options (e.g. Amazon Aurora)

hxrts commented 3 years ago

Who would like to take a first stab at an ADR? There are a lot of stakeholders, but I suggest one person own the ADR and solicit further design criteria and feedback from those who have contributed to this thread. If we are indeed going with SMT, perhaps someone from LazyLedger since they already have an initial implementation?

liamsi commented 3 years ago

Yeah, we can surely take the first stab on an ADR and collect the input and feedback from the different stakeholders. @tzdybal started looking into replacing iavl with an SMT in the SDK and is interested in taking the lead on this :-)

A shared slack or discord channel with all parties involved could also be helpful. We'd like to understand the different use-cases and priorities better. Should we go ahead and create a shared slack/discord, too?

alexanderbez commented 3 years ago

Let's start a channel in Discord.

liamsi commented 3 years ago

Great! I can't start a discord channel myself but looking forward to it 👍🏼

BTW regarding that SQL / RDBMS discussion above:

We might just be the first blockchain framework that uses RDBMS

it might not be a "classical" blockchain but google's trillian also comes with a mysql based storage backend (among others). As far as I remember it is not meant to be used in production as it is rather slow but it works fine for testing and developing (and everything is authenticated as with the other backends).

Ideally, the APIs / interfaces we settle on would even allow RDBMS backends (although I'm skeptical about whether we'd want to recommend that).

robert-zaremba commented 3 years ago

Also here's an ethresearch thread on the compact SMT paper tony posted above https://ethresear.ch/t/compact-sparse-merkle-trees/3741

@hxrts - when doing a report I went through that thread. IIRC, LL SMT is a realization of that idea.

robert-zaremba commented 3 years ago

@hxrts , @liamsi -- I spent good amount of time analyzing this issue and looking for a good solution. So let me outline the thoughts coming from researching this subject. Happy to contribute. If you want I can kick off and port the design described below into ADR.

With @aaronc we were thinking about SQL data base with regards to: https://github.com/cosmos/cosmos-sdk/issues/7099 With Vulcanize, we started a proof of concept for streaming data from Cosmos Node to Postgresql. In Regen I'm doing another MVP for building a whole relational database (also using Postgres) for higher level use.

For the usecase linked above (https://github.com/cosmos/cosmos-sdk/issues/7099) we started with an approach where each message / data will have a corresponding schema (https://github.com/cosmos/cosmos-sdk/issues/7097 and ADR-038) to properly assign a merkle proof to an off-chain data. I was thinking about an alternative design, influenced by a TurboGeth work I inspected few months ago with respect to the report I created for this task (https://github.com/cosmos/cosmos-sdk/issues/7100#issuecomment-685211612) and the general approach (https://github.com/cosmos/cosmos-sdk/issues/7100#issuecomment-676527835)

Decoupling state commitment from storage

State commitment is using it's own storage (could use the same database underneeth) from the State Machine store. In case of SMTs
State machine store is a simple KV store (could be a RDBMS -> read below)

Canonical IDs

Today, only a module knows about the ID (key) for every state object. The ADR-038 tries to solve it. I am trying to find another way to solve it by using canonical IDs - an ID constructed directly from a Message.

The idea is to use a QueryServer service name (eg "cosmos.bank.v1beta1.Msg") and combine it with a hash of a serialized object and maybe some other prefix. And we hash the whole thing. Example:

infix := "<container name>"
key := hash("cosmos.bank.v1beta1.<MsgName>.<infix>.<obj_hash>")

I'm not 100% sure if this work, but if not we fall back into the Module Schema idea (which is driven by the current way we store object with sub-stores).

If we stick with the "current" key construction, then I would strongly advocate to hash the key before storing it.

SMT for state commitments.

This part is more obvious:

SMTkey := Canonical Object ID using one of the algorithms from the previous section
SMTvalue := hash(object).

Finally we store a (SMTkey, SMTvalue) pair in a SMT (I'm skipping here details of how this operation works, eg constructing internal nodes etc...)

Data Store

Here we need few things:

mapping from hash(object) -> object
mapping SMTkey -> object (crucial to avoid traversing SMT if we just want to query objects in State Machine).
Garbage collection
Version management

Storage engine

KV store are order of magnitude faster than RDBMS. TurboGeth is using KV based indexing (so at the end they have 3 stores).

Based on all the existing blockchain implementations, I would suggest to start with porting SMT to Cosmos, update the existing KV interface, and only later consider experimenting with RDBMS. For more sophisticated use-cases, as explained above, we will need a separate RDBMS anyway. We don't want to kill node with analytics operations.

robert-zaremba commented 3 years ago

We might just be the first blockchain framework that uses RDBMS

Algorand is using SQLite as their main state storage (not sure if they are using anything else in fact). I outlined this in overview of storage backends.

UnitylChaos commented 3 years ago

Okay, I can totally buy that a KV store would be an order of magnitude faster than RDBMS for looking up an object by hash. And for modules where that is the access pattern that fully makes sense.

For example in my project I have Utxo style transactions, where a transaction specifies a set of hashes of the existing Utxos it will spend, and a set of new Utxos to be created. Storing existing Utxos by hash in a KV store would certainly be faster than using an SQL table with a hash primary key. (I would be curious to know exactly how much faster it be, say to compare in process sqlite with an in process KV store)

My desire to use something which supports full SQL queries was about modules/access patterns which require multiple indexes or joins. For example, an order book based DEX, might have a storage schema like: CREATE TABLE orders (creator address, asset AssetExpression, bid Boolean, limit Integer, creation Integer, expiry Integer) Where determining the clearing price for a given asset at a given time requires looking up all orders whose creation/expiry time contains the time, and the asset matches. So like, SELECT * FROM orders WHERE asset = *AssetExpression* AND creation<=time AND expiry>=time. This could be then joined to Utxo or Account data to check that balances are sufficient for these trades.

This can of course be replicated on a KV store by manually constructing extra indexes, and using iterators / repeated lookups to simulate joins. AFAIK this is what Cosmos-SDK is currently doing in modules like staking, and is the intention of #7098. This is fine, but it's a lot of extra engineering work to basically replicate what a RDBMS is designed to do.

P.S: Also even in the Utxo example, it would be good to have a secondary index on owner address so that an account like view could be constructed.

robert-zaremba commented 3 years ago

The most important thing is to Decoupling state commitment from storage . Once this will be done, we can have different storage backends (or even 2 at the same time) and experiment further.

robert-zaremba commented 3 years ago

There was a need to use chat / discussion for getting into more details. Let's move here if we need to discuss anything, and keep this issue only for important notes related to decisions.

So, here is the Github discussion

i-norden commented 3 years ago

For datastore I’m partial to LMDB or BadgerDB for all the reasons mentioned in the review doc. Faster, feature rich, and MVCC. Having a hard time distinguishing between those two, it looks like BadgerDB has the performance edge. It’s interesting to me that BadgerDB appears to be more popular and battle tested outside of the blockchain space but it appears no popular blockchain has decided to use it (as their default, anyways)? I’m curious if turbo-geth compared those two. Turbo-geth also uses an ETL preprocessing approach to sort keys before writing to the DB in order reduce write amplification, without this I am curious how much more performant LMDB actually is vs LevelDB.

I have sort of knee-jerk reaction to using an SQL DB as the primary storage due to performance reasons. An interface (e.g. tm-db) designed for operating on-top of kvstores will "work" for SQL databases but I don't think it will be performant enough for practical use. I can say from experience that substituting Postgres for LevelDB in go-etheruem does not work without a major access pattern overhaul, although SQLite should be significantly faster than Postgres for simple operations. I think the primary DB being embedded is also crucial, that would disqualify most SQL DBs but not SQLite. SQL seems "less risky" as a secondary database, populated by means of ADR 038 etc.

For state commitments, I agree SMT looks like the best option! One question, why wouldn't the value in the SMT be the object itself instead of the hash of the object?

Still thinking about the Canonical IDs.

tarcieri commented 3 years ago

Note that several SQL databases expose some sort of high-performance key/value API which skips the SQL query engine.

For example MySQL implements the memcached protocol, and this is also supported on Amazon RDS (although unfortunately not on GCP Cloud SQL).

Postgres has hstore.

robert-zaremba commented 3 years ago

Let's continue the discussion in https://github.com/cosmos/cosmos-sdk/discussions/8297 , I will respond there to @i-norden question.

faddat commented 3 years ago

https://github.com/handshake-org/urkel

?

cosmos / cosmos-sdk

What to do about IAVL? #7100

Summary

Problem Description

Proposals

Overhaul IAVL

Replace IAVL

Decoupling state commitment from storage

Canonical IDs

SMT for state commitments.

Data Store

Storage engine