codex-storage / nim-codex

Decentralized Durability Engine
Apache License 2.0
63 stars 24 forks source link

Validator needs to reconstruct its state from historical blockchain data #458

Open AuHau opened 1 year ago

AuHau commented 1 year ago

As part of the improvement in #457 we also need to add the ability to reconstruct the validator state (or make it persistent with some sort of database) when it boots up as the validator needs to validate also requests that were started before the validator booted up. Currently it only starts validating new requests.

marcinczenko commented 6 days ago

Validator Historical Data

At the moment the validator has started, there may already be existing active storage requests containing slots with slot ids that should be monitored by the validator: these are slot ids that either belong to the group observed by the validator or basically any slot id if validator is to monitor the whole slot id space.

In this context we want the validator to find and monitor all those slots id associated with storage requests that started earlier and have not yet concluded.

max duration of a storage request

Currently it does not look like the duration of the storage request is limited. In this case we need to pull the history events from some pre-configured date (it does not make sense to pull the history from ancient times when Codex did not exist).

getting historical data

To get the history data we can use a function similar to the one already defined on OnChainMarket:

method queryPastEvents*[T: MarketplaceEvent](
  market: OnChainMarket,
  _: type T,
  blocksAgo: int): Future[seq[T]] {.async.} =

  convertEthersError:
    let contract = market.contract
    let provider = contract.provider

    let head = await provider.getBlockNumber()
    let fromBlock = BlockTag.init(head - blocksAgo.abs.u256)

    return await contract.queryFilter(T,
                                      fromBlock,
                                      BlockTag.latest)

It looks like queryPastEvents is not used anywhere in the code base.

We can add a new function taking a fromDate argument that will take care of computing the fromBlock needed for the underlying queryFilter procedure.

To compute the block number at the fromDate we can first estimate the block number at the fromDate (taking either 12-15s as the average block time - we can also base this estimate on the current network conditions by looking at the timestamps of the last two blocks) and then fine tune using binary search (this is should give us the minimum number of calls).

I did not spot any existing functionality that already implements this. If you know that it is somewhere there, let me know...

Filtering events

I would start with the following procedure (let's call it ~findActiveSlots~ restoreHistoricalState):

  1. Retrieve all the SlotFilled events fromDate and store the corresponding SlotIds.
  2. Exclude those SlotIds that do not match the validation config (note to myself: do not forget to take maxSlots into account).
  3. From the remaining SlotIds, further remove those that are no longer in SlotState.Filled using validation.market.slotState(slotId).
  4. What's left are the SlotIds that are still in the SlotFilled state and thus need to be checked for proofs.
  5. Pre-populate validation.slots with the remaining slots.

~findActiveSlots~ restoreHistoricalState should be called ~at the beginning of Validation.run before the while true: loop~ in Validation.start ~before~ after calling await validation.subscribeSlotFilled().

I am doing some small spike (as I like to experience some of the libraries in isolation, here nim-ethers), and if this is kind of right what I am saying, I would aim to have it ready for review in about 2 days. The corresponding PR is coming soon.

emizzle commented 5 days ago

Nice design @marcinczenko 🙌

It looks like queryPastEvents is not used anywhere in the code base.

Apart from in the tests, the usage of this has been removed from the codebase as it was no longer needed. However, I knew we'd be using this function again so it was left in :) Usage is like so:

let reqs = await market.queryPastEvents(StorageRequested, 5)

(taking either 12-15s as the average block time - we can also base this estimate on the current network conditions by looking at the timestamps of the last two blocks)

We will be launching mainnet on our own Linea L2, I believe? The mainnet block times seem to be around 2s for now:

Average block time on Linea mainnet

I like your idea of monitoring the block times, however my opinion is that for now, we find out what the average will be for the L2 we deploy on, and use that as a "hardcoded" average time. With that time, you can find the approximate block number (head - 30 days in seconds / 2 secs per block), get the block at that block number, and inspect the block timestamp. If the timestamp is < 30 days ago, keep getting a previous block until the a block timestamp of >= 30 days before now is found.

For now, we are targeting 30 days as the maximum request duration, which is likely to be gradually increased in the future.

  1. From the remaining SlotIds, further remove those that are no longer in SlotState.Filled using validation.market.slotState(slotId).

The validator does this already in removeSlotsThatHaveEnded, so no need to double-up imo. However, this has been my main concern from the start: that we may end up waiting some time to get responses back for all these slotState requests. Maybe there's a way to optimise these calls, eg batching? We discussed already the option for persisting the events in the RepoStore, but have opted to avoid that extra work for now, and possibly revisit it later on if needed.

marcinczenko commented 4 days ago

let reqs = await market.queryPastEvents(StorageRequested, 5)

Didn't you mean let reqs = await market.queryPastEvents(SlotFilled, 5)?

We will be launching mainnet on our own Linea L2, I believe? The mainnet block times seem to be around 2s for now:

Indeed that is pretty much certain. I still keep thinking of Ethereum for some reason.

I like your idea of monitoring the block times, however my opinion is that for now, we find out what the average will be for the L2 we deploy on, and use that as a "hardcoded" average time. With that time, you can find the approximate block number (head - 30 days in seconds / 2 secs per block), get the block at that block number, and inspect the block timestamp. If the timestamp is < 30 days ago, keep getting a previous block until the a block timestamp of >= 30 days before now is found.

30 days - seems good to me.

Initially thought that your suggestion to perform flat search would be simpler than doing binary search, but after thinking a bit I am not that convinced anymore. The average block time on Linea (https://lineascan.build/chart/blocktime) seems to be quite stable at the moment (fluctuations are below 5% on daily average). Yet, how much can we depend on it I do not know. For 30 days, assuming 2s between blocks, we are talking about 2592000 / 2 = 1296000 blocks. Let's just notice that if - because of fluctuations or some other reason - we get 10% of an average variation then this would mean 1296000 * 0.1 = 129600 blocks to travel. Even at 5% this is still potentially lots of block to query and we will immediately need to do some sort of optimisation. Binary search that I proposed is a standard way to deal with this problem, and it would reduce the number of searches quite a bit (being O(logN)) we only need 20 comparisons for 1000000 items). The implementation is also pretty standard and super easy (and I already have implemented a version of queryPastEvents that is already using binary search (played with it yesterday evening before you responded). Would it be reasonable that we just start with it?

However, this has been my main concern from the start: that we may end up waiting some time to get responses back for all these slotState requests. Maybe there's a way to optimise these calls, eg batching?

Yes, if the number of observed slots grows to a large number, we would indeed start to feel it. Batching may indeed help, maybe there is somewhere else. Not sure if we should immediately worry about? If that would be up to me only, I would probably start simple without any optimisation. Feels to me a serious thing, but maybe it can wait till get some first users?

We discussed already the option for persisting the events in the RepoStore, but have opted to avoid that extra work for now, and possibly revisit it later on if needed.

In my opinion this a logical consequence of reconstructing the state. But I also agree (plus the same reasons as stated in the previous point), that it can wait a bit. I think this should happen before being tempted to do any batching or similar optimisation. With persistence in place, we should be able to respond to high demand even after periods of inactivity. We just need to pay attention that we do things asynchronously: we need to make sure we subscribe to the new events before requesting the log. Otherwise, if there is something blocking and taking more time than the time between two blocks, we may miss some events - but that's really extreme I think, I just to keep it in mind.

The validator does this already in removeSlotsThatHaveEnded, so no need to double-up imo.

I think it should be enough. Just remember that this last stage of filtering of the log will happen not earlier than at the end of the current period (because of await validation.waitUntilNextPeriod()). It may be beneficial to use that time to already deal with the historical log. Just a thought...

emizzle commented 3 days ago

linea, binary search

From what I understand, we will have our own side chain, so my guess is block times will be mostly stable and we'll know when it is expected to change. Of course, that is a guess :)

The binary search sounds like a good idea. How will you know how far back (in blocks) to go on the first search?

Batching. Feels to me a serious thing, but maybe it can wait till get some first users?

I don't feel like batching would be a lot of effort, but yeah maybe it's best to get this working first without optimisations. However, I'd also like to hear what others' opinions are. My thoughts are that the request latency is the biggest issue. Even if validators run their own execution client, at 50ms total for a round trip (a single slotState RPC request/response), that's 50 seconds for every 1000 filled slots, and can add up quickly. If we can reduce that by a factor of 10 (as an example batch size), then it seems much more easy to work with.

The batch could be a simple view in the contract that returns the state of 10 slots in one call. The validator would get all SlotFilled events, then for every 10, make a slotStateBatched request.

remember that this last stage of filtering of the log will happen not earlier than at the end of the current period

Good point. Maybe it would be better to do it beforehand then. removeSlotsThatHaveEnded could also use batching to speed up ;)

marcinczenko commented 3 days ago

The binary search sounds like a good idea. How will you know how far back (in blocks) to go on the first search?

I am still doing initial estimate based on the average time between blocks and then I am adjusting low and high accordingly. Binary search quickly goes to the middle point anyway, so those optimisations seem to have very little impact on the number of steps (mostly no impact at all, sometimes it can save maybe one jump, so looks like we do not have to worry about it - perhaps it will be overly cheaper not to do any premature optimisations).

I am doing something like this to initialise low and high:

var low = 0
var high = latestBlockNumber
if estimatedBlockTimestamp < epochTimeUInt256:
  low = estimatedBlockNumber
else:
  high = estimatedBlockNumber

I don't feel like batching would be a lot of effort, but yeah maybe it's best to get this working first without optimisations. However, I'd also like to hear what others' opinions are. My thoughts are that the request latency is the biggest issue.

I feel the same. When saying that it is "a serious thing" I meant that we do need to have it in mind. So yes, I agree we need to have that on the horizon. I would probably first go for state preservation and then this, but I am open to a different order as well (at least lets keep both things as separate issues/PRs, so that things are easier to get merged).

The batch could be a simple view in the contract that returns the state of 10 slots in one call. The validator would get all SlotFilled events, then for every 10, make a slotStateBatched request.

I was tempted to let the validator to ask for all slots it wants in one batch, but that indeed could grow to big. So yes, this sounds like something to try (although the max number in one batch can probably be bigger).

marcinczenko commented 3 days ago

Here is a small chain independent demo that I used to play with search without depending on the blockchains before porting it to our code: https://play.nim-lang.org/#pasty=ECupqAtx