Minimizing round trips to the daemon on wallet load

j-berman commented 5 months ago

Continuing from point 2 in this comment in the async scanner: https://github.com/UkoeHB/monero/pull/23#issuecomment-2036086371

Ideally wallets minimize round trips to the daemon when loading. This was the idea behind returning pool txs in the /getblocks.bin endpoint: https://github.com/monero-project/monero/pull/8076

The Seraphis lib scanner currently does 1) scan chain -> 2) scan pool -> 3) scan chain in a scan pass, which would require a minimum of 3 round trips to the daemon. @UkoeHB mentioned:

I intentionally added a follow-up pass because there is a race condition between chain scanning and pool scanning that can cause mined blocks to contain txs invisible to a wallet even after it claims to have scanned the pool.

Proposed solution to this problem: ensure that when the /getblocks.bin endpoint returns pool txs, those pool txs correspond to the state of the pool at the top_block_hash also returned by the endpoint (to be clear, this would be a change to the daemon RPC side). This way the scanner can be certain that once it has scanned up to chain tip, it knows the state of the pool at that time.

j-berman commented 5 months ago

(obviously the state of the pool isn't fixed to a block hash, but since txs can't be both in the pool and the chain, then this should work)

j-berman commented 5 months ago

Alternative that's probably simpler: scan pool first then chain. If any of the user's pool txs are identified in the chain, it can be inferred it's no longer in the pool. This is basically the same as wallet2 today.

UkoeHB commented 4 months ago

Alternative that's probably simpler: scan pool first then chain. If any of the user's pool txs are identified in the chain, it can be inferred it's no longer in the pool. This is basically the same as wallet2 today.

This doesn't work for long scan jobs because the pool view will be quite stale by the end.

j-berman commented 4 months ago

Seems acceptable because the pool just gets scanned on the next refresh loop

UkoeHB commented 4 months ago

I'd rather the scanner user control the features it uses, than for the scanner itself to be weakened. Can you accomplish the desired optimization with a config, or by simply ignoring the initial or follow-up scan?

j-berman commented 4 months ago

The primary function in question:

bool refresh_enote_store(const scanning::ScanMachineConfig &scan_machine_config,
    scanning::ScanContextNonLedger &nonledger_scan_context_inout,
    scanning::ScanContextLedger &ledger_scan_context_inout,
    scanning::ChunkConsumer &chunk_consumer_inout)
{
    // 1. perform a full scan
    if (!refresh_enote_store_ledger(scan_machine_config, ledger_scan_context_inout, chunk_consumer_inout))
        return false;

    // 2. perform an unconfirmed scan
    if (!refresh_enote_store_nonledger(SpEnoteOriginStatus::UNCONFIRMED,
            SpEnoteSpentStatus::SPENT_UNCONFIRMED,
            nonledger_scan_context_inout,
            chunk_consumer_inout))
        return false;

    // 3. perform a follow-up full scan
    // rationale:
    // - blocks may have been added between the initial on-chain pass and the unconfirmed pass, and those blocks may
    //   contain txs not seen by the unconfirmed pass (i.e. sneaky txs)
    // - we want scan results to be chronologically contiguous (it is better for the unconfirmed scan results to be stale
    //   than the on-chain scan results)
    if (!refresh_enote_store_ledger(scan_machine_config, ledger_scan_context_inout, chunk_consumer_inout))
        return false;

    return true;
}

How I'm thinking about options also based on some discussions we've had in the past:

Route 1 (scan chain then pool)

refresh_enote_store is rewritten to do a 2-step refresh_enote_store_ledger, then refresh_enote_store_nonledger (or include a config option to skip the step 3 follow-up chain pass)
server-side RPC /getblocks.bin grabs a lock on the chain and pool for duration of serving the response to ensure the pool state corresponds to the top block hash returned by the endpoint
the nonledger_scan_context_inout and ledger_scan_context_inout would both be children of a single parent class
in step 1 refresh_enote_store_ledger, the ledger_scan_context_inout is able to fetch pool txs and will keep those pool txs cached in the parent class
in step 2 refresh_enote_store_nonledger, those pool txs will be read from the parent class cache in nonledger_scan_context_inout.get_nonledger_chunk and consumed
- side comment: if we wanted to prevent clients from holding the entire pool in memory, we can cache pool tx hashes here (separate discussion)
the ledger scan context's begin_scanning_from_index also wipes the parent's tx pool cache

Pros

Achieves minimal trips to daemon.
Achieves a "correct" refresh_enote_store which completes a full scan of all state known to the daemon at a specific point in time.

Cons

I'm hesitant to mess with the concurrency model of the daemon's /getblocks.bin. It adds more complexity and risk to the daemon-side changes, and will add to review time.
The parent scan context class feels a bit hacky. The code clearly separates the nonledger and ledger scan contexts; it feels hacky to give nonledger work to the ledger scan context.

Route 2 (scan pool then chain)

config option to ignore the first refresh_enote_store_ledger in refresh_enote_store
the nonledger_scan_context_inout and ledger_scan_context_inout would again both be children of a single parent class.
in step 1 refresh_enote_store_nonledger, the nonledger_scan_context_inout fetches pool txs and blocks via /getblocks.bin, caching blocks in the parent class.
in step 2 refresh_enote_store_ledger, blocks are read from the parent class cache in ledger_scan_context_inout.get_onchain_chunk and consumed. If there are more blocks to scan, it'll do so.

Pros

Achieves minimal trips to daemon.
No further changes to the daemon are necessary.

Cons

It is possible for the pool state to be stale by the time all scanning is done.
- Mitigated by the fact that the client will scan the pool again on next refresh loop.
- Technicallllly also worth pointing out that with a 3-step scan pass, the pool state can still be stale.
Still hacky parent class approach (the non ledger scan context is tasked with fetching and caching onchain block data)

Bonus Route 1a or Route 2b

Seraphis lib is rewritten as follows:

Chunk chunk = scan_context.get_next_chunk()

if chunk == nonledger chunk
    handle nonledger chunk
else if chunk == ledger chunk
    handle ledger chunk (ensure contiguity to prior ledger chunk and consume)
else
    must be the terminal chunk, reached scanner end state

The scan context could either go with a similar approach to route 1 (scan chain then pool but daemon guarantees pool state at top block hash), or route 2 (scan pool then chain).
The pro to this approach is that it removes the need for the hacky parent class approach.
The con is that it leaves more "correctness" handling of the scanner to the scan context sort of. It's also marginally harder to reason through (what order are nonledger / ledger chunks going to be consumed).

Edited to add clearer terminal chunk handling.

j-berman commented 4 months ago

Route 2 is probably simplest to implement imo

UkoeHB commented 4 months ago

Ok after reviewing the seraphis_lib implementation I remember why I wrote it like this. You don't need to use refresh_enote_store. Notice that it's in the file scan_process_basic.h/.cpp. The idea is you can write any high-level function you want to orchestrate scanning using the scan-machine API. So I'd just do option 2 with a new function.

UkoeHB / monero