Analyze and improve store/v2 performance

roysc commented 2 years ago

Summary

Analyze the performance in store/v2 MultiStore and smt.Store, and identify the causes of inefficiencies and bottlenecks, and resolve them.

Problem Definition

According to the store benchmarks, the v2 MultiStore is performing slower than the existing IAVL-based MultiStore. Performance was one of the main concerns motivating ADR-40, so it's essential that we find a way to improve on this.

Proposal

[ ] Initial rounds of benchmarks https://github.com/cosmos/cosmos-sdk/issues/10297
- [ ] https://github.com/cosmos/cosmos-sdk/pull/10722
[ ] Run a second, more focused set of benchmarks https://github.com/cosmos/cosmos-sdk/issues/11444

For Admin Use

[ ] Not duplicate issue
[ ] Appropriate labels applied
[ ] Appropriate contributors tagged
[ ] Contributor assigned/self-assigned

roysc commented 2 years ago

Added references to child issues:

roysc commented 2 years ago

Analysis

Both reads and updates are overall significantly slower for the SMT-based stores vs. IAVL.

Using benchstat we can compare the overall performance under different scenarios. In all the following tests, v1 refers to the IAVL-based KV store and multistore, and v2 to the SMT-based KV store and multistore.

In this suite we test pure get and set operations in two cases - with keys that all exist in the store (present), and keys that all do not (absent). Note that no deletes or commits are performed. getpast performs reads at an earlier version.

Comparison of benchmarks run on v1 ("old time") and v2 ("new time") using benchstat

name                            old time/op  new time/op    delta
KVStore/memdb/get-present-4     1.52ms ±12%    2.75ms ± 8%    +81.36%  (p=0.000 n=18+18)
KVStore/memdb/get-absent-4      1.65ms ± 4%    4.15ms ±67%   +151.91%  (p=0.000 n=16+20)
KVStore/memdb/set-present-4     3.41ms ± 3%  174.08ms ±43%  +5001.40%  (p=0.000 n=18+20)
KVStore/memdb/set-absent-4      4.54ms ±24%  169.74ms ±14%  +3638.43%  (p=0.000 n=18+20)
MultiStore/memdb-getpast-4      14.6ms ± 6%     5.2ms ±63%    -64.44%  (p=0.000 n=17+19)
MultiStore/memdb/get-present-4  1.55ms ± 3%    2.66ms ± 6%    +71.80%  (p=0.000 n=18+18)
MultiStore/memdb/get-absent-4   1.73ms ± 4%    2.79ms ±19%    +61.17%  (p=0.000 n=18+20)
MultiStore/memdb/set-present-4  3.69ms ±12%  201.79ms ±21%  +5362.11%  (p=0.000 n=17+20)
MultiStore/memdb/set-absent-4   4.56ms ± 7%  203.86ms ±12%  +4372.38%  (p=0.000 n=18+20)

Reads

One of the motivations of ADR-40 was to improve performance by introducing a store design that required no tree traversal on reads. The v2 store maps values directly in the backing KV database, so reads do not have to go through the SMT structure. However, reads are still slower in v2 (mostly; see below).

Factors to note:

IAVL stores an in-memory "cache" of live tree node objects (not to be confused with the nodedb node cache)
- The backing DB is only touched on load/commit
- When the tree is built entirely in the current session, all values are available via this cache
- When the tree is loaded entirely from a DB, no existing nodes are cached (only new inserts)
  - i.e., nodes from a loaded immutable tree are not cached until they are cloned into a mutable tree
The SMT implementation has no such cache, it fetches from the backing DB in all cases
- it also hashes keys on all operations to uniformly distribute records in the tree domain
memdb (the fastest DB backend) is implemented with a btree, which requires traversal
- IAVL's bespoke tree with cached nodes is just traversed faster than the btree

The effect of this design difference on performance is made clear in cases where IAVL does not cache nodes, which includes the getpast case. This is the only case above in which v2 outperforms v1, which is explained by the fact that IAVL must read from the DB on each node access.

We can highlight the difference even more by reloading the store before testing - now, v2 is faster for all get operations.

KVStore/memdb/get-present-4     38.2ms ± 7%     2.7ms ±13%    -92.93%  (p=0.000 n=15+13)
KVStore/memdb/get-absent-4      37.3ms ± 7%     3.0ms ± 4%    -92.05%  (p=0.000 n=14+14)
KVStore/memdb/set-present-4     12.0ms ±45%   159.6ms ±16%  +1234.56%  (p=0.000 n=15+15)
KVStore/memdb/set-absent-4      8.39ms ±26%  165.88ms ±17%  +1878.30%  (p=0.000 n=15+15)
MultiStore/memdb-getpast-4      15.4ms ±11%     4.0ms ±48%    -73.71%  (p=0.000 n=12+15)
MultiStore/memdb/get-present-4  12.3ms ±45%     2.6ms ± 7%    -79.21%  (p=0.000 n=14+14)
MultiStore/memdb/get-absent-4   6.65ms ±23%    2.72ms ±22%    -59.10%  (p=0.000 n=15+14)
MultiStore/memdb/set-present-4  7.16ms ±24%  203.57ms ±19%  +2742.92%  (p=0.000 n=15+15)
MultiStore/memdb/set-absent-4   6.21ms ±24%  205.42ms ±14%  +3210.47%  (p=0.000 n=15+15)

Writes

The difference in performance for updates in v2 is much more dramatic, with a slowdown of more than an order of magnitude (with memdb).

IAVL just inserts new nodes into the live ("cached") tree, balancing and cloning as needed; again, the backing DB is not touched until commit
the SMT recomputes the root hash on every update, which involves reading all node and sibling node data along the leaf path, in addition to the actual hashing
the SMT also inserts placeholder sibling nodes at intermediate points on paths shared by leaf nodes; these must also be rehashed
We've tried to improve on this by adding a cache layer to the MapStore used within the SMT, which provided a small speedup (~10%)

Notes

These benchmarks were run on this branch, which is rebased on the latest v2/Multistore code: https://github.com/vulcanize/cosmos-sdk/tree/roysc/store_benchmarks/store. Run using this script: https://gist.github.com/roysc/45c0ba7631b9ba15d2838ece282d0183

I can provide more data or info as needed. We can also begin discussing solutions here.

i-norden commented 2 years ago

The aforementioned cache wrap for the SMT: https://github.com/vulcanize/smt/blob/cache_listener_wrap/cache_wrapped_mapstore.go (note - branch at https://github.com/roysc/smt/tree/cache_wrap patched to work with benchmarks - Roy)

Note that we need this for ADR-052 so performance was not the main motivation for the cache wrap (but it is nice to see that it not only didn't have a negative impact, but a slightly positive one :D). The reason the impact is not more significant is because we are already doing caching at the DB transaction level (since, in the case of the SDK, the MapStore backing the SMT is a DB transaction object).

It should also be noted that the SMT implementation maintains its own mapping of hash(key) => value that enables it to support key lookups without having to perform a tree traversals (the values mapstore here: https://github.com/celestiaorg/smt/blob/master/smt.go#L21). This is ofc a necessary feature for a general SMT implementation, but it is unused in the context of the SDK since the SDK maintains its own mapping of key => value at the state store level (B1 bucket). So, another optimization would be to remove this mapping maintained at the SMT level. We don't know yet how significant of an impact this might have on time/op but it would remove one DB operation for every KVStore Set/Delete and would remove the duplication of every value on disk (considering how large values can be this should be a significant disk usage optimization).

cosmos / cosmos-sdk