Adding pruning="daily" - Githubissues

mdyring commented 2 years ago

Summary

The current pruning support based on block heights does not map well to real world use.

A common reporting pattern is "end of day" / "close of business" balances. Currently the only way to support this is pruning="nothing" which is very space inefficient.

It would be useful to have mechanism that would keep the last block for a given day based on the block timestamp.

Problem Definition

Anyone who does any serious accounting of crypto assets needs "end of day" or "end of year" balances.

To provide end of day balances one needs to either keep all historical application state (pruning="nothing"). Finding public archive nodes that keeps all blocks are getting harder by the day, as costs to operate these nodes increase.

Archive nodes are also extremely space inefficient, cosmoshub-4 is roughly 3.5TB. These large data sets stresses the IO subsystem as LevelDB needs to reorganise/reindex growing amounts of data.

Proposal

My suggested way to do implement this would be:

Keep all blocks created during the day
Once the date rolls over at midnight, prune all blocks for the (now previous) day except the last one.

This enables random queries of state during the day and ensures the pruning only happens once a day as well.

A mechanism to discover these "end of day" blocks by date would be helpful, so clients does not need to scan all blocks.

alexanderbez commented 2 years ago

A common reporting pattern is "end of day" / "close of business" balances. Currently the only way to support this is pruning="nothing" which is very space inefficient.

Blockchains don't have "close of business" haha. I presume you mean for your own personal operations so that you can live a relatively sane life? In any case, "pruning=nothing" isn't even a close approximation to achieving this. Instead, you estimate the rough number of blocks in a day and set the interval to that value so all the blocks are pruned at that height, this has the downside that you'll incur a larger memory footprint holdings all those heights in memory along with the actual act of pruning all of those heights at once, which might be very expensive and cause you to miss a block or even a few?

I don't see much value add in adding such complexity to the existing pruning mechanism personally.

mdyring commented 2 years ago

Blockchains don't have "close of business" haha. I presume you mean for your own personal operations so that you can live a relatively sane life?

Yeah, all those pesky meat-space issues. :-)

It is not just about sanity though: Being able to provide accurate Year End balances is important for tax purposes, but it also just seems like a generally useful thing to be able to show changes to account balances over time.

In any case, "pruning=nothing" isn't even a close approximation to achieving this. Instead, you estimate the rough number of blocks in a day and set the interval to that value so all the blocks are pruned at that height, this has the downside that you'll incur a larger memory footprint holdings all those heights in memory along with the actual act of pruning all of those heights at once, which might be very expensive and cause you to miss a block or even a few?

Example

Let's imagine I need to know the Year End balances of account cosmos1mtauzk3q40zt3weujcqwu009vw5t6m5fnv4xxr at midnight, 31st of December 2021.

The account was randomly selected for illustrative purposes and assume queries against https://rpc.cosmos.directory/cosmoshub.

First, the block height just before midnight needs to be identified. This can be accomplished by doing a binary search of block heights/timestamps by hand or rolling some code to do it.

It turns out height 8902500 on cosmoshub-4 is pretty close (2021-12-31T23:49:37Z). Actually, it is close as we can get to midnight as it seems no nodes behind cosmos.directory are configured with prune="nothing".

According to QuickSync this setting results in 3.5TB of state, making finding such a node the near equivalent of finding a unicorn.

Anyway, now that we closest block height it should be easy to query the account balances.

But it is not so:

~ gaiad q bank balances cosmos1taz02v2mk88de95jpumz8pyqknn525c23sz0e5 --node https://rpc.cosmos.directory:443/cosmoshub --height 8902500 --output json | jq
{
  "balances": [],
  "pagination": {
    "next_key": null,
    "total": "0"
  }
}

I can keep repeating the above command 50 times without success, cycling through the nodes behind cosmos.directory.

After numerous failures I found that just a single node, hosted by the great team at Chorus One, had kept app state for this block:

~ gaiad q bank balances cosmos1mtauzk3q40zt3weujcqwu009vw5t6m5fnv4xxr --node https://cosmos.chorus.one:443 --height 8902500 --output json
{
  "balances": [
    {
      "denom": "uatom",
      "amount": "3941990992"
    }
  ],
  "pagination": {
    "next_key": null,
    "total": "0"
  }
}

Summary of the bad news

The above illustrates that no Cosmos Hub nodes are running with prune="nothing" for obvious reasons (cost and lack of incentives).

It seems the best case is "keep every 100 blocks" and I were only able to identify one single node configured like this.

So we obviously need a better pruning strategy that balances storage costs and real world (meatspace) utility:

Keeping every block is not very useful and very expensive
Keeping every 100 blocks is by all appearances also too costly (only a single node does it)
Keeping a single block every day minimises storage costs (365 copies of state per year) and maps naturally to everyday life and accounting practices

Possible solution

Let's ignore the search needed to find the "end of day" block (as illustrated above) for now.

The actual pruning logic for pruning="daily" should be very simple:

When creating a new block, compare the new block date with the previous kept block.

If they are identical (same day), prune the previous kept block and just keep the new block.
If they are not identical (rolled over to new day), leave the previous kept block and also keep the new block.

This ensures that the most recent block is always available for the present day, and the last block of all previous dates are available as well.

tac0turtle commented 2 years ago

Block time is arbitrary, it is not something predictable by us, or if it was the complexity to get it right may elude to different issues. I'm not sure this is possible or the complexity to make it work would be greater than a node operator setting a custom prune amount.

I would prefer to redesign our storage layer to prune at x heights and place every x height in plain text in storage. Keeping detailed records like this in the merkle tree seem unnecessary to me, or would need a proof of the data as well?

mdyring commented 2 years ago

Block time is arbitrary, it is not something predictable by us, or if it was the complexity to get it right may elude to different issues. I'm not sure this is possible or the complexity to make it work would be greater than a node operator setting a custom prune amount.

I would prefer to redesign our storage layer to prune at x heights and place every x height in plain text in storage. Keeping detailed records like this in the merkle tree seem unnecessary to me, or would need a proof of the data as well?

The solution suggested above does not need predictable block times. It prunes after block times are known and is, AFAICS, very simple to implement:

After finalising height X, it just needs a way to determine the block time of height X-1 and then prune block X-1 if it falls on the same date as the block time of X.

alexanderbez commented 2 years ago

@mdyring I'm having trouble reconciling the issue here. Seems to be like two completely orthogonal things...? The initial post describes a new pruning strategy (which I don't think is necessary). The followup post seems to be describing trouble finding archive nodes? I'm lost.

In any case, when Cosmos first launched, the idea was that operators would run archive nodes themselves (or pay infra service providers) when they need historical data (e.g. explorers or clients).

After a few years, this is obviously no longer the case for various reasons (IAVL, cost of storage, disk size, etc...). So the idea now instead is rely on indexers! This gives you numerous benefits:

Queries are now completely customizable and do not require the SDK to be bloated with silly query handlers (e.g. balances by denom)
No need to run archive nodes, you can run a basic pruning node (assuming you have all data indexed

For your example, you'd ask the indexer or indexer service, "What is the balance of account A at time/height t/h?" Or "what was the last block in 2021?"

tac0turtle commented 1 year ago

this is configurable by the node operator. It wouldnt be hard to write a script that calculates blocks per day and sets pruning accordingly.

mdyring commented 1 year ago

this is configurable by the node operator. It wouldnt be hard to write a script that calculates blocks per day and sets pruning accordingly.

@tac0turtle I think this is an oversimplification: The goal would be to store the last block before day roll over.

Just storing every X block is already possible, as you correctly state, but that is useless for auditing/accounting purposes which operates on calendar dates.

@alexanderbez suggested using a 3rd party service (indexer). Unfortunately I am not aware of any such service providing point-in-time balances (or other state at a specific point in time).

So overall, I still believe this would be a useful addition.

tac0turtle commented 1 year ago

in store v2 this can be done in a simpler form, via store snapshots. Kind of how we do state sync snapshots.

Also teams like numia.xyz have data of a few chains and are constantly adding more

cosmos / cosmos-sdk

Adding pruning="daily" #13415

Summary

Problem Definition

Proposal

Example

Summary of the bad news

Possible solution