concurrent in-memory and db structures updates

hackaugusto commented 1 year ago

The store maintains an in-memory version of a few data structures, the TSMT for the nullifiers, the SMT for the accounts, a full MMR. Additionally the store maintains a DB, with long term storage for the aforementioned structures (and anything else that doesn't need to be available in-memory).

The issue is that currently all of the structures mentioned above are written for single writer / single version (e.g. guarded by a RWLock that prevents reads while updates are being performed). The consequence is that DB updates and in-memory accesses require synchronization, and may have a performance impact (specially for merkle tree updates, that may require lots of hashing computations).

The issue is to discuss possible implementations to improve concurrent usage of these structures.

hackaugusto commented 1 year ago

some initial ideas:

reduce the critical section, so that hashing wont prevent reading. could be implemented with:
- exclusive lock, that prevents concurrent writes
- hashing would be done after the previous lock is acquired, the lock wouldn't affect reads. Changes would be computed in a separated structure, e.g. a temporary vector.
- granular read-write lock, that prevents reading while the structure is being modified in memory. the single writer would acquire this lock, preventing new reads, and then apply the updates from the temporary vector
- the above should drastically reduce the critical section
use a persistent data structure, this could be implementation in multiple ways:
- could be done via a modified version of the merkle store, with native support for the required operations. the store would keep all the old versions of the trees, which would support the read operations, and the new trees would be built concurrently, by adding the new computed elements to the store, without needing an external copy of the data to perform these computations
- if the previous version uses too much memory, one could use a key-value store, combined with a caching solution, to keep the recently modified/read elements in memory.
- alternatively, one could implement the previous structure with pruning, the pruning could be based on versions, or fixed memory pool

bobbinth commented 1 year ago

To put a bit more context around this:

There are two Merkle tree structures which need to be updated with every block: account DB and nullifier DB (I'm ignoring MMR updates as these should be negligible). The number of updates will depend on TPS. Specifically, for every transaction, we'll need to update a state of 1 account, and thus, recompute the root of the account DB. Similarly, for every new nullifier we'll need to re-compute the root of nullifier DB. The tables below show the number of hashes needed to apply a block to the DB for 3 levels of TPS:

	100 TPS	1000 TPS	10000 TPS
Tx / block	300	3000	30000
MT updates	600	6000	60000
Hashes	19K	192K	1.9M
Time (1 thread)	0.06 sec	0.6 sec	6 sec

Assumptions in the above table:

Block latency: 3 seconds.
Account updates per tx: 1.
Nullifiers per tx: 1 (could be different, but I think that's a reasonable assumption).
Number of hashes per Merkle tree update: 32 (also an assumption but should be somewhere in this range).
Hashes/second: 320K (this is what we get with RPX and with hardware acceleration we may get a bit better).

Ideally, we'd want to be able to apply block updates very quickly, even if the part on the critical path is much smaller than the entire update. I think we should target applying a block in under 50ms. This means that by the time we get to 100 TPS we need to be able to process account and nullifier DB updates in multiple threads.

hackaugusto commented 1 year ago

I was curious if memory layout had a perceivable impact in the hashing speed. So I did some simple benchmarks:

using sequential memory (similar to how the miden_crypto::MerkleTree constructor works)
updating a MerkleTree a) with monotonically increasing positions b) with random positions
updating a SimpleSmt a) with monotonically increasing positions b) with random positions

On my machine the throughput is not affected at all with memory layout. (all versions gave me around ~210K/s). So it seems that using rayon to compute these values is a good solution.

Bench code

```rust use criterion::{criterion_group, criterion_main, BatchSize, BenchmarkId, Criterion, Throughput}; use miden_crypto::{ hash::rpo::{Rpo256, RpoDigest}, merkle::{MerkleTree, SimpleSmt}, Felt, ZERO, }; const TEST_DEPTH: usize = 16; fn hash_bench(c: &mut Criterion) { for power in 10..16 { { let mut group = c.benchmark_group("Hash RPO256 2-to-1"); let hash_count = 2u64.pow(power); group.throughput(Throughput::Elements(hash_count)); group.bench_with_input( BenchmarkId::from_parameter(hash_count), &hash_count, |b, &hash_count| { b.iter_batched( || { // data for the tests, for each hash a pair of elements is needed let pairs_count = hash_count * 2; let data: Vec = (0..pairs_count) .map(|i| [ZERO, ZERO, ZERO, Felt::new(i)].into()) .collect(); // pre-allocated result vector, the allocation should not count in the // benchmark let result: Vec = (0..hash_count).map(|_| Default::default()).collect(); (data, result) }, |(pairs, mut result)| { let mut pos = 0; for i in 0..result.len() { result[i] = Rpo256::merge(&[pairs[pos], pairs[pos + 1]]); pos += 2; } }, BatchSize::SmallInput, ) }, ); } { let mut group = c.benchmark_group("Hash MerkleTree"); let hash_count = 1 << power; group.throughput(Throughput::Elements(hash_count)); group.bench_with_input( BenchmarkId::from_parameter(hash_count), &hash_count, |b, &hash_count| { b.iter_batched( || { let size = 1 << TEST_DEPTH; let leaves: Vec<_> = (0..size).map(|i| [ZERO, ZERO, ZERO, Felt::new(i)]).collect(); MerkleTree::new(&leaves).unwrap() }, |mut state| { // each leaf update results into depth hashes let updates = hash_count / (TEST_DEPTH as u64); for i in 0..updates { let value = [ZERO, ZERO, ZERO, Felt::new(i)].into(); let index = state.root()[3].inner() & 0b1111111111111111; state.update_leaf(index, value).unwrap(); } }, BatchSize::SmallInput, ) }, ); } { let mut group = c.benchmark_group("Hash SimpleSmt"); let hash_count = 1 << power; group.throughput(Throughput::Elements(hash_count)); group.bench_with_input( BenchmarkId::from_parameter(hash_count), &hash_count, |b, &hash_count| { b.iter_batched( || SimpleSmt::new(32).unwrap(), |mut state| { // each leaf update results into depth hashes let updates = hash_count / 32; for i in 0..updates { let value = [ZERO, ZERO, ZERO, Felt::new(i)].into(); let index = state.root()[3].inner() & 0b1111111111111111; state.update_leaf(index, value).unwrap(); } }, BatchSize::SmallInput, ) }, ); } } } criterion_group!(hash_group, hash_bench); criterion_main!(hash_group); ```

hackaugusto commented 9 months ago

The critical section was reduced to a minimum by https://github.com/0xPolygonMiden/miden-node/pull/70 :

https://github.com/0xPolygonMiden/miden-node/blob/8e2f6e7649b72c611fa339a79a08f70bce94af41/store/src/state.rs#L311-L319

The above approach requires a copy, but will always be the case with a modify-in-place structure.

0xPolygonMiden / miden-node

concurrent in-memory and db structures updates #58