dan-da commented 6 months ago

This is a first cut at making the storage layer async in neptune-core.

I am making it a draft PR because I have a few remaining todos to polish things up:

[x] add sync ~~and storage~~ [1] benches from twenty-first, modified as needed (easy)
[x] investigate removal of impl Mmr for ArchivalMmr. done, see writeup.
[ ] investigate remaining runtime error(s). (do they also occur on master?)
[x] rename crate::locks::sync to crate::locks::std. (easy)
[x] rebase and squash commits, and/or improve commit messages.

stretch goal:

[x] impl MmrAsync trait, ie approach 2 in this writeup. (done: see #121)

[1] storage benches put aside for now because they would require async support in divan which appears to be in progress, or else a re-write without divan.

Present status:

all tests and doctests pass.
fixed a runtime error: "In order to be transferred, a Block must have a non-None proof field." This seems to have been introduced elsewhere, not due to async mods.

One node (of 3) hits this runtime error:

thread 'main' panicked at /home/danda/dev/neptune/neptune-core/src/models/state/wallet/wallet_state.rs:559:9:
assertion `left == right` failed: Mutator set in wallet-handler must agree with that from applied block

there is a warning:

"2024-03-20T02:35:03.002579345Z  WARN ThreadId(03) neptune_core::mine_loop: Received block is timestamped in the future; mining on future-timestamped block."

dan-da commented 6 months ago

Regarding ArchivalMmr and the Mmr trait

Decoupling ArchivalMmr from the (non-async) Mmr trait seems complex. That trait is used in a number of places, such as:

  MutatorSetKernel<MMR: Mmr<Hash>>
  MsMembershipProof::batch_update_from_addition()
  RemovalRecord::batch_update_from_addition()
  mutator_set::insert_mock_item()
  mutator_set::remove_mock_item()

MutatorSetKernel is the most entangled. It is defined as:

    pub struct MutatorSetKernel<MMR: Mmr<Hash>> {
        pub aocl: MMR,
        pub swbf_inactive: MMR,
        pub swbf_active: ActiveWindow,
    }

It is instantiated in other locations as both ArchivalMmr (async) and MmrAccumulator (sync, from twenty_first). It has several complex methods, and is non-trivial.

Presently, I have ArchivalMmr implementing Mmr trait by spawning a new OS thread for each trait-method call which creates a new tokio runtime, which calls the ArchivalMmr async method that actually implements the functionality (and interacts with storage layer). Example:

    fn append(&mut self, new_leaf: Digest) -> MmrMembershipProof<H> {
        std::thread::scope(|s| {
            s.spawn(|| {
                let runtime = tokio::runtime::Runtime::new().unwrap();
                runtime.block_on(self.append_async(new_leaf))
            })
            .join()
            .unwrap()
        })
    }

This works, but seems quite inefficient. Unfortunately though the only way I know of to call an async method from a sync method already executing in an async runtime is to spawn another thread with a new runtime.

Worse though than inefficient, is that this actually makes the async calling fn block until the thread is finished. So for ArchivalMmr, this is no better (actually worse) than before this PR with regards to blocking on storage calls. Since ArchivalMmr is used heavily by MutatorSetKernel, this seems a big problem.

However, It may be adequate for merging this PR, and then I can optimize/improve in a follow-up.

Optimizing: Getting rid of the thread::spawn()

Brainstorming approaches.

An improvement, though not a solution, would be to create a long-lived thread for each ArchivalMmr instance with a 2-way message channel. So instead of creating a thread per method call, we would pass tasks and results back and forth. This eliminates the cost of creating an OS thread plus tokio runtime for each method call. Unfortunately however it still means that ArchivalMmr methods block on storage, and our goal is to avoid that.
An (obvious?) approach is to make a new MmrAsync trait and a new MmrAccumulatorAsync that also uses it. That should work and might be the best solution, but it forces code to become async that isn't presently, and doesn't really need to be. It just becomes forced into it because of implementing or using the Mmr trait. This also gets messy because implementors of various traits from tasm-lib are involved.

Do not use any Mmr trait and keep MmrAccumulator sync. This would require that MutatorSetKernel can somehow accomodate both sync and async. It could mean adding a MutatorSetKernelAsync that duplicates a lot of logic. Or it could mean that we define it something like:

    pub enum MmrType {
        ArchivalMmr,
        MmrAccumulator,
    }

    pub struct MutatorSetKernel {
        pub aocl: MmrType,
        pub swbf_inactive: MmrType,
        pub swbf_active: ActiveWindow,
    }

The difficulty here is that all MutatorSetKernel methods must use a ton of match statements like:

   match self.aocl {
       MmrType::ArchivalMmr(m) => m.count_leaves().await,
       MmrType::MmrAccumulator(m) => m.count_leaves(),
   }

This is unwieldy. (though probably could be made less ugly with a macro)

Both approaches 2 and 3 require MutatorSetKernel to become async, which means that anything using it must also be async. Approach 3 allows MmrAccumulator to remain sync, is the primary advantage, at cost of making MutatorSetKernel more complex.

A final approach I considered would impl ArchivalMmr with a Vec instead of a StorageVec. So it would not be database backed, but would instead load all data from DB at creation, store pending writes, and persist back on request. This should allow ArchivalMmr trait methods to remain sync. Async methods would exist for loading and persisting data.

The drawback is that all data must be loaded up-front and remain in mem. I'm unsure of how much data we are talking about, but my guess is that it may be huge eventually, and not possible to load/store all of it in RAM.

Specifically, ArchivalMmr is used for MutatorSetKernel::aocl and MutatorSetKernel::swbf_inactive fields.

All this considered, approach 2 seems to me the most straight-forward and the one I intend to pursue.

dan-da commented 6 months ago

Ok, I've made branch make_storage_and_mmr_trait_async with approach 2 (async Mmr trait) that builds cleanly, though not tests yet.

ArchivalMmr no longer needs to spawn a thread plus runtime for each trait method invocation. So that's a win.

I do have to spawn a thread in one other place though: RemovalRecordsIntegrity::rust_shadow(). This is in the impl CompiledProgram for RemovalRecordsIntegrity trait impl, and the trait is defined in tasm-lib and has default method impls, so that fn cannot be made async easily.

In order to get the program to build, I had to comment out several impls of tasm-lib traits. These seem mainly related to testing, and I think they really belong within test module anyway. I'm not certain they can be made to build again unless the tasm traits are made async as well.

my next task will be to get the tests building again.

dan-da commented 6 months ago

tests are building cleanly and passing in branch make_storage_and_mmr_trait_async.

except for some proptests I had to comment-out for now because proptest crate/macro is not async compatible. I see somebody created a proptest_async crate recently (yesterday?) but it only works with async_std, not tokio yet. The author states it would be easy to add tokio support.

The thread spawn in RemovalRecordsIntegrity::rust_shadow() remains. I don't know how often that fn will be called, or what impact it might have on performance, if any. For now I don't have a better solution, but it may be possible to refactor it out for those who understand tasm-lib well.

Anyway, I am happier with the code in make_storage_and_mmr_trait_async, so after a little more cleanup/review I will open a new PR for that branch that will obsolete this PR.

aszepieniec commented 6 months ago

This error

thread 'main' panicked at /home/danda/dev/neptune/neptune-core/src/models/state/wallet/wallet_state.rs:559:9:
assertion `left == right` failed: Mutator set in wallet-handler must agree with that from applied block

indicates that somewhere the mutator set removal records or membership proofs are being updated incorrectly. The following propositions are probably true (quantified over unknown details):

The error was introduced as a side-effect of simplifying the Block struct, specifically the Body part. Previously it held two mutator set accumulators, one before and one after applying the additions and removals given by the transaction. This is redundant, but also makes keeping track of things easier.
The error is caused by incorrectly updating removal records or membership proofs with the update contained in a block. The new block structure makes this update step rather more complex.
It is possible to reduce this error to a failing unit test, by which I mean a test function that does not require running multiple interacting nodes or even threads.

dan-da commented 6 months ago

closing in favor of #124.

Neptune-Crypto / neptune-core

Make storage async #120

Regarding ArchivalMmr and the Mmr trait

Optimizing: Getting rid of the thread::spawn()