What to do about IAVL? - Githubissues

aaronc commented 3 years ago

This is linked to meta-issue #7096.

Summary

This issue discusses potential approaches to the ongoing maintenance or replacement of IAVL.

Problem Description

It our understanding that IAVL has effectively become an orphaned project within the Cosmos ecosystem. I don't have a good understanding of all of the issues related to IAVL, but welcome those with more context to share. /cc @ethanfrey @erikgrinaker @alexanderbez

Proposals

Overhaul IAVL

A complete overhaul has been proposed as one solution. I don't have the context fully for why this is needed, but would appreciate those who do to share those details here.

Replace IAVL

Replacing IAVL with a more mature alternative has been proposed as well. I'll just list out the alternatives that have been mentioned here:

https://github.com/oasisprotocol/oasis-core/tree/master/go/storage/mkvs
- @zmanian said: "It appears to support the same core features including state sync. I suspect it's of higher engineering quality and has a maintainer."
https://github.com/lazyledger/smt
- https://twitter.com/musalbas/status/1294624434226167810?s=21
- has no GC/pruning
- /cc @hxrts @marbar3778 @liamsi
https://github.com/yihuang/libra-jellyfish-merkle
- /cc @hxrts
- https://github.com/crypto-com/chain/blob/6d22c38c62778a60a77324e86d12a000171fb3b0/architecture-docs/adr-003.md

robert-zaremba commented 3 years ago

I think there at two separate issues:

efficient storage -- this can be done with RocksDB or something similar. Libra, for example is using RocksDB for their storage module - in fact it's a wrapper on top of RocksDB to deal with the serialization of keys and values (we need to to store multiple types of data, and key-value pairs in RocksDB are byte arrays). This wrapper enforces that all data in and out of the DB is structured according to predefined schemas.
Storage proofs / storage commitments. To prove the state transition we need to be able to prove transaction inputs and outputs. This is usually done by committing to whole state using a Merkle Root.

robert-zaremba commented 3 years ago

Depending on the use-case it would be good to have closer look at Merkle Trees alternatives for storage commitments. I will review the following papers and write a summary next week.

Pointproofs: Aggregating Proofs for Multiple Vector Commitments. It provides non-interactive aggregation of proofs across multiple commitments. This construction enables any third party to aggregate a collection of proofs with respect to different, independently computed commitments into a single proof represented by an elliptic curve point of 48-bytes. I believe this could be very useful for light clients and all sorts of batch processes.
Merkle Mountain Range -- Merkle tree optimized for lists of sequentially appended data and its proofs.

alexanderbez commented 3 years ago

I think we're explicitly avoiding RocksDB for various reasons. I do agree we should explore alternatives in logical storage. MMR, sparse merkle trees, and urkle trees all look like good contenders. One thing to keep in mind is that we need:

prefix iteration
range proofs
inclusion/exclusion proofs

zmanian commented 3 years ago

IAVL is very problematic in it's IO patterns. After Amino, it's probably the largest Cosmos SDK performance bottleneck. It also uses a lot of indirection that makes look ups very inefficient.

robert-zaremba commented 3 years ago

@alexanderbez thanks for comment. Is there any resource summarizing your decision about RocksDB? Also, what's the use case for doing prefix iteration?

aaronc commented 3 years ago

Prefix iteration is used extensively throughout the SDK. It's how we do index scans.

hxrts commented 3 years ago

@marbar3778 mentioned the other day that our aim to be database-agnostic is also causing an additional burden on state tree update optimisation. may want to consider limiting database support in conjunction with this issue to leverage database-native efficiencies.

cc @musalbas if you have thoughts on use of jellyfish here

tac0turtle commented 3 years ago

I think we're explicitly avoiding RocksDB for various reasons.

rocksdb is supported. I would not say we are avoiding it because we have done next to zero database configuration testing. It is still unclear if goleveldb is better for the sdk than others (badgerdb, boltdb, rocksdb). There are a couple users who do use rocksdb.

Cosmos-SDK, IAVL, and Tendermint has support for 5 databases. Being that the Cosmos-SDK and hopefully Tendermint are considered frameworks for building blockchains, an agnostic approach to supported databases was taken. This is nice but causes some headaches in optimizations. We can provide sane defaults for various dbs but this leads to more testing of which we have none of.

I do agree with:

I think there at two separate issues:

Efficient storage should be a separate issue and have its own discussion. I opened an issue (https://github.com/cosmos/cosmos-sdk/issues/6406) to discuss supporting fewer dbs based on testing that needs to be done. This can be a start.

alexanderbez commented 3 years ago

Ohh I never said it wasn't supported, just from my experience, there are much better alternatives (e.g. LevelDB, Badger, etc..).

One thing to keep in mind as well is that there are libraries out there that are very well supported, tested, and benchmarked. Continuing to have to maintain IAVL especially when it's sub-optimal at best, is probably not the best strategy.

I would be great if someone would take lead on exploring alternatives.

musalbas commented 3 years ago

Generally, it seems that the blockchain space is heading towards Sparse Merkle trees (SMT) for state trees - including Ethereum 2.0 (although it seems now they're leaning towards polynomial commitments), Libra (jellyfish), LazyLedger plus also Google's Trillian. However, all of these projects are using slightly different designs for the tree, and so these implementations are not necessarily interchangeable. Libra and LazyLedger use a compute-optimised SMT by replacing subtrees with only default values with a single node; Ethereum 2.0 plans to use a compute-optimised tree by using a more efficient hash function with a standard SMT; and Trillian uses a standard unoptimised SMT (afaik).

At LazyLedger we have settled on using the Libra SMT, and implemented it in Golang. Libra has an implementation in rust called jellyfish, but it is slightly different as it operates on "nibbles" (where each key is split into sections of 4-bytes each). In LazyLedger's SMT implementation the storage code is abstracted away as a "MapStore" interface, so devs can define their own storage backend.

I'm not sure about the implementation complexity of IAVL trees, but I think SMTs are relatively simple and shouldn't be difficult to maintain. The core of the LazyLedger SMT implementation is 400 lines of code. Although we store each node in the tree in the DB, so I'm not sure if that addresses the IO performance concerns.

robert-zaremba commented 3 years ago

I think Ethereum 2.0 big bet is on polynomial commitments.

robert-zaremba commented 3 years ago

@musalbas - why did you settled on jellyfish?

musalbas commented 3 years ago

We're not using jellyfish, we're using our own SMT library, that adopts the same SMT design as specified in the Libra whitepaper. We chose this as it's the simplest tree design we're aware of with a O(log(k)) get/update cost. Vitalik's scheme using an optimised hash function is simpler, but there were some concerns around the security of this, and it still requires a hash operation every 16 levels in the tree.

alexanderbez commented 3 years ago

Does Jellyfish or LL's compute-optimized SMT allow for prefix iteration/range queries? If so, that seems like a solid bet.

musalbas commented 3 years ago

No I don't think so, because the keys are hashed, and so they should appear in random places in the tree. A range query would thus simply consist of multiple Merkle proofs.

robert-zaremba commented 3 years ago

Prefix iteration can be done separately. It doesn't need to be done on the storage commitment unless you need proofs / commitments while doing the iteration. Range queries can be implemented with prefix iteration.

Where the the range proofs are used? What's the use-case for them? Maybe this can be solved with aggregateable commitments (like the aforementioned pointproofs or RSA accumulators).

alexanderbez commented 3 years ago

@AdityaSripal do we have/need range proofs in the SDK? I do know we need to absolutely support range/prefix queries. However, I don't know what you mean by A range query would thus simply consist of multiple Merkle proofs. @musalbas? Can we avoid hashing the keys before insertion?

robert-zaremba commented 3 years ago

@alexanderbez , @musalbas - braking a range query into single queries for merkle proofs (assuming this is needed) sounds very inefficient.

musalbas commented 3 years ago

@AdityaSripal do we have/need range proofs in the SDK? I do know we need to absolutely support range/prefix queries. However, I don't know what you mean by A range query would thus simply consist of multiple Merkle proofs. @musalbas? Can we avoid hashing the keys before insertion?

Hashing keys is necessary for a compute-optimised SMT, because if an attacker can choose which part of the tree a new key can be added to, they can cause Merkle proofs for certain keys to be very long (by inserting a key directly next to another key). Hashing prevents this by randomising the location that keys are in the tree.

If you want to prove membership of multiple keys in the tree, that would require an individual Merkle proof per key, regardless if they are within the same "range".

It would be helpful to know where range proofs are being used for the state tree though. I understand why range proofs may be helpful for the Merkle tree of transactions, but not sure why they would be needed for the state tree.

alexanderbez commented 3 years ago

Hashing keys is necessary for a compute-optimised SMT, because if an attacker can choose which part of the tree a new key can be added to, they can cause Merkle proofs for certain keys to be very long (by inserting a key directly next to another key). Hashing prevents this by randomising the location that keys are in the tree.

I see. I guess this isn't a problem with IAVL because it's self-balancing...

Since SMT hash keys prior to insertion like merkle-patricia trees do, I don't see how we'd have prefix iteration unless we can figure out how to leverage the underlying physical storage somehow :(

AdityaSripal commented 3 years ago

@AdityaSripal do we have/need range proofs in the SDK? I do know we need to absolutely support range/prefix queries.

At present, we do not support range proofs (of more than 1 key) in the SDK. Since the subspace queries do not support returning proofs. https://github.com/cosmos/cosmos-sdk/blob/master/store/iavl/store.go#L271

I brought it up in an issue a while ago here: https://github.com/cosmos/cosmos-sdk/issues/5241

And tried to do some work on it, but i couldn't get it to work and there were other more pressing blockers for IBC proofs so I dropped it and worked on those

musalbas commented 3 years ago

Hashing keys is necessary for a compute-optimised SMT, because if an attacker can choose which part of the tree a new key can be added to, they can cause Merkle proofs for certain keys to be very long (by inserting a key directly next to another key). Hashing prevents this by randomising the location that keys are in the tree.

I see. I guess this isn't a problem with IAVL because it's self-balancing...

Since SMT hash keys prior to insertion like merkle-patricia trees do, I don't see how we'd have prefix iteration unless we can figure out how to leverage the underlying physical storage somehow :(

I believe if you use the optimised hash function optimisation for SMTs instead of the Libra optimisation, you can have free choice over where keys are stored in the tree, and can thus support range proofs. cc @adlerjohn

hxrts commented 3 years ago

@erikgrinaker is on vacation, but when he's back it would be good to understand more about the current status of IAVL and what's needed to ensure the correctness of that implementation.

sunnya97 commented 3 years ago

Just adding another option here: https://github.com/nomic-io/merk by @mappum

It uses RocksDB, supports checkpointing, is still an AVL tree so supports range queries, and has the pieces to support state syncing.

erikgrinaker commented 3 years ago

@erikgrinaker is on vacation, but when he's back it would be good to understand more about the current status of IAVL and what's needed to ensure the correctness of that implementation.

As far as I know, IAVL is correct - it's just slow and inefficient, and not concurrency-safe. I think this is in part because it tries to do too much itself (e.g. versioning) which should ideally be offloaded to the underlying database engine. The access patterns are also suboptimal, requiring lots of random accesses - I'm not familiar with the state of the art in Merkle trees, but there's a reason databases use e.g. B+trees instead of binary trees.

Building a performant data store is non-trivial, and not something we should spend engineering resources on if we can avoid it in my opinion. I'd be happy to help explore alternatives if necessary.

robert-zaremba commented 3 years ago

@erikgrinaker the advantage of B+trees is that in general you have faster lookup (they are more compact). This is especially important for the storage layers, which are not very fast on random access, or where you can't efficiently use hardware caches.

erikgrinaker commented 3 years ago

Exactly. I suspect an AVL tree is inherently inefficient given block devices and CPU caches, but I don't know how Merkelization affects these tradeoffs.

mappum commented 3 years ago

In Merk, the Merkle tree structure doesn't affect the lookups (every node is indexed by its key rather than its hash), so there is practically no read overhead compared to plain RocksDB.

alexanderbez commented 3 years ago

Merk looks very promising! Do you know how it compares to IAVL? I'm sure much better, but just want to get a feel for raw numbers. Also, we'd need to probably port this to Go if there isn't already a version out there.

mappum commented 3 years ago

Do you know how it compares to IAVL?

15x - 40x higher write throughput last time I compared on my MacBook (the lower end with none of the tree kept in memory, the higher end with all of it kept in memory).

Also, we'd need to probably port this to Go if there isn't already a version out there.

Someone did port it to Go: https://github.com/tak1827/merk-go, although I'm not sure what the status of that is, plus it might be easier to just make bindings for the Rust lib rather than keep the 2 in sync.

erikgrinaker commented 3 years ago

Yeah, I'd lean towards bindings. I'm also fantasizing about using wasm for cross-language libraries, but don't know if that's viable or performant yet.

robert-zaremba commented 3 years ago

It seams that the library is not very big. Also I'm preparing a general overview and a spec doc to put together all requirements and see the best solution (maybe we can do better than Merkle Tree).

robert-zaremba commented 3 years ago

I'm still missing a piece about range proofs - where and why do we need it? I see a need for bulk proofs / aggregated commitments -- for light client use cases and stateless clients.

erikgrinaker commented 3 years ago

Aren't range proofs a prerequisite for absence proofs, since these are really just range proofs over the neighbors of the absent key?

robert-zaremba commented 3 years ago

Aren't range proofs a prerequisite for absence proofs, since these are really just range proofs over the neighbors of the absent key?

Depending on definition and implementation. For me a range proof is a more an aggregated commitment, or a proof over a range of values.

Absence proof is a proof that a specific value is not part of a state. With Markle Trees absence proof is implemented by providing proof of two neighbors.

mappum commented 3 years ago

I'm still missing a piece about range proofs - where and why do we need it?

It makes proofs much more efficient (in terms of proof size, and CPU/IO for the node resolving the proof) when using data structures more advanced than just simple account maps. For example, you can split a large struct into multiple keys, and still get a proof of the fields you need without needing N completely separate Merkle branches.

Although that's just one optimization, the important part is that if you've implemented the tree to support range proofs you can also likely iterate by key in the on-chain logic (very significant for on-chain orderbooks, for example).

musalbas commented 3 years ago

Not necessarily - it depends on the tree design. In SMTs, non-membership proofs don't require range proofs, just a proof to show that the longest prefix of the key leads to a leaf with an empty value.

On 24/08/2020 19:28, Erik Grinaker wrote:

Aren't range proofs a prerequisite for absence proofs, since these are really just range proofs over the neighbors of the absent key?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cosmos/cosmos-sdk/issues/7100#issuecomment-679292158, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGOEBNRI7LGUJY4F6UPN7DSCKWM7ANCNFSM4QD7TDGA.

robert-zaremba commented 3 years ago

It makes proofs much more efficient (in terms of proof size, and CPU/IO for the node resolving the proof) when using data structures more advanced than just simple account maps. For example, you can split a large struct into multiple keys, and still get a proof of the fields you need without needing N completely separate Merkle branches.

I've got that. But I don't see the use case where this will apply. Since the accounts are randomly distributed, then practically speaking, none set of proofs will represent a range. Instead, as I wrote above, I see the big use case for aggregated proofs.

mappum commented 3 years ago

Not sure why there needs to be an explicit distinction. In Merk any selection of keys can be included in a query, so it can be any combination of ranges or individual keys. https://github.com/nomic-io/merk/blob/develop/docs/algorithms.md#structure

But a concrete example for ranges we are working with is an order book, where we need to access the best orders (sorted by price). We store these orders with keys being the encoding of (price, address). Getting a proof of e.g. the 100 highest bids means getting a single range which will only have around log(n) hashes, whereas an aggregated proof of 100 completely random keys would be around 100 * log(n) hashes.

sunnya97 commented 3 years ago

iterate by key in the on-chain logic (very significant for on-chain orderbooks, for example).

Many of the Cosmos SDK core modules make extensive use of this feature (basically anywhere you see an iterator, such as in governance or staking).

And, here's an example of the orderbook like @mappum suggests (https://github.com/sunnya97/sdk-dex-mvp/blob/master/x/orderbook/orderwalls.go).

hxrts commented 3 years ago

@blinky3713 might want to follow along here since FOAM had some specific needs from IAVL

ValarDragon commented 3 years ago

You can generically add key iterability to the leaf data structure, independently of the tree format for handling the relationship between leaves.

In all the usecases of key iterability I know of in the SDK, we know at app-writing time that a particular leaf wants key iterability.

martyall commented 3 years ago

Thanks @hxrts. I don't really care what happens as long as someone maintains or ports the grpc interface with potobuf files that I worked really hard to revive and get merged. If that goes away I'm kind of at a loss of words as to how disappointed I will be after (1) writing this argument up and having multiple people from tendermint/cosmos sign off on it (at least implicitly) and (2) implementing it myself in this PR after multiple rounds of feedback, considering that I almost never write go and it was an ordeal for me.

If you are reading this and are unfamiliar with kepler, the haskell language version of the cosmos sdk, see here

robert-zaremba commented 3 years ago

I was checking the SDK, and I couldn't find any usage of range proofs. The Order book example uses a storage iterator, not a range proof. And the storage iterator can be implemented on the storage layer, not on the state commitments layer.

We don't construct computation proofs, so the state transition is only validated through the block acceptance: re-computation of all transactions in the block and state commitment check.

Range proofs, or partial commitments / selective commitments could be very useful for stateless client though.

robert-zaremba commented 3 years ago

I've finalized a report to review potential solutions for storage and new commitment schema (today I added conclusions). I've put it on the Dropbox paper - it's much easier to read and comment then on Github PR (Markdown in diff format is not very friendly). Later I can move it to github if needed. Main gals:

learn Pointproofs and FlyClient and see if we can use them
review other solutions we discussed above.

THE REPORT has 3 sections: the state commitments review, storage review and conclusions (personal recommendation). Would love to hear your feedback.

robert-zaremba commented 3 years ago

@mappum - I went through the Merk description (didn't go into the code). It looks interesting, however few things are not clear to me. I added them as TODO in the review I linked above. Would be great to finalize it:

Most of the current solutions we are analyzing separate value from the commitment scheme (so, for example the SMT stores only the required hashes and key parts).
Could you asses is if Merk is using less storage then SMT? For me it's opposite: optimized SMT stores only required nodes and can compress keys (we don't store same prefixes twice). Whereas Merk is storing whole keys all the time.
Is there any benchmark comparing Merk and SMT based commitment?

mappum commented 3 years ago

Could you asses is if Merk is using less storage then SMT? For me it's opposite: optimized SMT stores only required nodes and can compress keys (we don't store same prefixes twice). Whereas Merk is storing whole keys all the time.

I would assume the key overhead means Merk is using more storage, but IMO for the systems we're building, storage is low on the resource priority list. We can still achieve similar key compression for encoding proofs so there is not a higher bandwidth cost for queries/state sync.

Is there any benchmark comparing Merk and SMT based commitment?

Not that I know of, what operations are you interested in comparing?

robert-zaremba commented 3 years ago

For the tree storage - would be great to keep it small so it can easily fit in RAM.

Re benchmarks: Most of the state commitment operations are related to adding and updating state elements.

musalbas commented 3 years ago

I think Merk will likely be faster (at least for read operations, not sure about updates) because its implementation uses implementation optimisations that various SMT implementations don't make. However, I think that the optimisations that Merk makes are behind-the-scenes implementation optimisations that can be implemented for other trees like SMTs too, without changing the underlying data structure. For example, storing nodes by key rather than by hash, to reduce the number of database accesses required to get a key from up to 256 to 1. This doesn't seem to require changing the actual data structure of the tree, just how the tree is stored locally.

mappum commented 3 years ago

For the tree storage - would be great to keep it small so it can easily fit in RAM.

For the in-memory tree representation, Merk could easily achieve the same key compression as an SMT, we're not bound to representing entries the same way as we do on disk: https://github.com/nomic-io/merk/issues/36

cosmos / cosmos-sdk

What to do about IAVL? #7100

Summary

Problem Description

Proposals

Overhaul IAVL

Replace IAVL