codex-storage / nim-codex

Decentralized Durability Engine
https://codex.storage
Apache License 2.0
63 stars 25 forks source link

Implement storage limits for local repository #159

Open dryajov opened 2 years ago

dryajov commented 2 years ago

Currently, the local repository doesn't implement any storage limits and it's possible to fill up the entirety of the users drive.

We need to add:

Bulat-Ziganshin commented 2 years ago

while working on the BlockStore API, I found that current code doesn't use delBlock at all, meaning that local store can only grow :) So I think it's also important feature for any practical usage, no less that SQLite backend.

dryajov commented 2 years ago

Yes, this is definitely an important feature for a usable product, so this should be relatively high on our priority list.

Bulat-Ziganshin commented 2 years ago

Yes, all issues that currently in our priority list are assigned to me or @michaelsbradleyjr, so we will take care of it.

If this simple approach will go out of control, we can create milestone for them.

michaelsbradleyjr commented 2 years ago

I have started researching some possibilities.

For SQLiteStore introduced in #160 we could make use of SQLite's DBSTAT Virtual Table.

The SQLITE_ENABLE_DBSTAT_VTAB compile-time option seems to already be enabled in our builds via nim-sqlite3-abi, e.g. this query works for a database created by an instance of type SQLiteStore

SELECT pgsize FROM dbstat WHERE name='Store' AND aggregate=TRUE;

The result is "the total amount of disk space used" in bytes. Note that "total" doesn't include "Freelist pages, pointer-map pages, [or] the lock page", so numbers reported by e.g. ls may include some overhead.

We could periodically query the dbstat info and if the database is over quota, we could sum the data size for the oldest N rows (100 or 1000 or whatever) with respect to the timestamp column that's already part of the Store table created by nim-datastore. Repeat M times until we've reached the amount of space we want to clear within some delta, then delete the oldest N * M rows.

That would require some custom queries, but nim-datastore exports the necessary fields and helpers so that it's entirely doable.

We would need to do some experimentation in relation to VACUUM and whether we want/need to set auto_vacuum=FULL.

I don't have an idea yet for the best approach for FSSstore, but if we end up settling on SQLiteStore then maybe we won't need to worry about it.

I'm not aware of any general cross-platform solution for a percentage-based quota. For the platforms we primarily target (Linux, macOS, Windows) we can probably shell out and use cli utilities in PATH, and log a warning if the expected utility/ies were not present, errored, or otherwise gave a result that Codex is unable to parse.

dryajov commented 2 years ago

Doesn't SQLite provide some native facility to limit the DB size?

Bulat-Ziganshin commented 2 years ago

Doesn't SQLite provide some native facility to limit the DB size?

even if it does, we need solution deleting LRU blocks rather than hard limit that will just prevent adding new data to DB.

michaelsbradleyjr commented 2 years ago

Taking another crack at this from an API perspective.

This is a sketch, not getting into details of how to deal with cli params, whether the API should be method or proc, or other specifics.

What if for type BlockStore we have

method quota(self: BlockStore, percentage: Percentage): Future[?!int]

method quota(self: BlockStore, bytes: int): Future[?!int]

Where the return value is the number of bytes over/under quota.

In the case of percentage: Percentage, the param value indicates the max percentage of storage space that the blockstore should be able to consume relative to the current total size of the filesystem on which it resides.

In the case of bytes: int, the param value indicates the max size of the blockstore.

How the return value is calculated would depend on the implementation of ref object of BlockStore and possibly the target OS.

And we could also have

method purge(self: BlockStore, bytes: Positive, delta: Percentage): Future[?!Natural]

Where the return value is the number of bytes actually purged.

The bytes: Positive param value indicates how many bytes should be purged.

The delta: Percentage param value indicates a percentage range +/- for the return value that is considered acceptable/expected relative to bytes: Positive. If the value to be returned is outside the range, a warning could be logged by method purge prior to return.

How the purge operation is performed would depend on the implementation of ref object of BlockStore and possibly the target OS.

It would be expected that purge implementations take into consideration LRU metadata for blocks derived from e.g. the filesystem or extra data in the database.

For SQLiteStore that would likely involve using a Datastore implementation (maybe Codex-specific) that derives from type SQLiteDatastore and has an extra column in the Store table. That's certainly within the realm of possibility, and I have some ideas for that (i.e. how that column would be periodically updated), but the specifics can be explored later here or in another context.

markspanbroek commented 2 years ago

Something to keep in mind: there are also blocks that we should keep around because they're part of a contract that we're engaged in. Not sure whether these should go into a separate store, or that the store itself should treat them differently.

Bulat-Ziganshin commented 2 years ago

Yeah, contracts and these restrictions contradict, so we may choose between:

Also, we essentially have two parts in blockstore, and their sizes probably should be regulated independently:

markspanbroek commented 2 years ago

The Sales module only sells storage that it's been told that it is available. So we only get into contracts for disk space that we explicitly marked as available.

So you can let contracts know about the restrictions.

Bulat-Ziganshin commented 2 years ago

But for the starter, let's develop the system ignoring contracts problem. I think we can use any of three approaches:

As far as we have "last access time" for each block, we can run DELETE queries requesting to delete N blocks with oldest timestamps. I think this solves simplfied (contract-ignoring) problem for SQLite backend.

So, the algorithm is:

Since SQLite is inherently single-threaded system (AFAIK), we can still have too large delay on a single mass-DELETE operation. It may be preferable to delete records in smaller batches.

Actually, to distribute load evenly, we can each time delete only a few records, and even just a signle one in extremum. So, alternative algorithm:

Bulat-Ziganshin commented 2 years ago

So you can let contracts know about the restrictions.

@markspanbroek I think about adding to BlockStore a request availableContractSpace: int64 or isAvailableContractSpace(int64), so when you get opportunity to fill a slot, you consult with BlockStore to check whether we can participate. What you think about it?

Sidenote: we can have such checks with other subsystems too (CPU, Network and disk bandwidth required for PoRs).

markspanbroek commented 2 years ago

I would rather have a way to reserve space in the BlockStore. I expect a user would want to dedicate an amount of storage for sale explicitly. There's a REST API endpoint for doing that in the codex node already. When the user indicates that it wants to sell e.g. 100 Gb, the store should always keep this space available for contracts to use.

dryajov commented 2 years ago

For now we simply need something that will refuse to write past a specific limit, without doing any automatic deletion or cleanup.

What I would like to see as a first step is:

Bulat-Ziganshin commented 2 years ago

@dryajov this approach will stop caching of new data once we reached the limit. So, yes - it will work, but data downloaded from the network will be not cached anymore. In particular, #171 (prefetching) will not work.

dryajov commented 2 years ago

@dryajov this approach will stop caching of new data once we reached the limit. So, yes - it will work, but data downloaded from the network will be not cached anymore. In particular, #171 (prefetching) will not work.

Thats correct, I want the minimum possible set of functionality right now, we can evolve this iteratively. In particular, it should be fine to stop storing new blocks once we reach the limit.

Bulat-Ziganshin commented 2 years ago

In particular this means that ECC will became useless - we rely on storing decoded data in the local store.

dryajov commented 2 years ago

In particular this means that ECC will became useless - we rely on storing decoded data in the local store.

How so? We're only talking about limiting storage use, nothing else. Sure some files would not be downloadable in full due to size, but this is already true - you can't download more than the size of the hard drive.

I would like to avoid all the complexity associated with "hot" pruning (client is running) for now and focus on a simple and concrete task. We'll come back to this once we have a better understanding of how the software behaves and what sort of real requirements we are working with.

As a second step, we can do "cold" pruning (client is stopped), which should give us a good understanding of what sort of issues we're going to be dealing with once we move to "hot" pruning.

Bulat-Ziganshin commented 2 years ago

How so?

Recall this discussion about erasureJob:

So, once we stopped saving data from network to local store, we may get very slow downloads and ECC will definitely stop working. Prefetcher/decoder overall rely on our ability to store entire file in cache.

dryajov commented 2 years ago

How so?

Recall this discussion about erasureJob:

  • We spawn erasureJob or prefetchBlocks and then go to download data sequentially by block CIDs via StoreStream
  • So, if the all original data are fine, we will not prefetch data, and just receive them in slow sequential way via StoreStream
  • but if some original block is lost, we can't restore it because we rely on sending recovered blocks via local store

So, once we stopped saving data from network to local store, we may get very slow downloads and ECC will definitely stop working. Prefetcher/decoder overall rely on our ability to store entire file in cache.

Yes, thats expected, I never meant this issue to handle more complex cases than simple limits on the total amount of bytes the repo can occupy. We can live with some level of degraded functionality once we have reached our storage quota, the alternative being simply stopping the process.

Doing proper hot pruning is not an easy task which overall buys us very little at this point. We should improve it, but I don't see any pressing need to do it immediately.

Bulat-Ziganshin commented 2 years ago

From discussion with Dmitry:

  1. Implement a hard limit that stops writing new data (putBlock) once the limit is reached
  2. Compute current disk usage by b/g task executed ~~ once per second (just sum up size of SQLite files)
  3. Return to the marketing team the amount of disk quota not yet used - add API node.unusedDiskQuota()

@emizzle will it work for you?

emizzle commented 2 years ago

Return to the marketing team the amount of disk quota not yet used - add API node.unusedDiskQuota()

I'm not sure if this node > sales info would be useful. Instead, perhaps the information should flow in the opposite direction, such that the sales modules can inform the datastore how much space it is using in established contracts or wants to use in future contracts, ie "reserve" the space in the datastore?

As @markspanbroek pointed out, currently a host specifies how much space they want to make available to sell (via rest api). This "availability" is modified by the sales module, such that when a contract is started, the availability is removed (even if the availability was greater than the amount consumed in the contract -- something we probably should optimise later) and when the contract has finished, the availability is added back. In either case, once availability is added by the host (via rest api), perhaps the sales module should inform the datastore to "reserve" this space to not be consumed? What do you think @Bulat-Ziganshin?

The flow could be something like this:

  1. Host adds availability via rest api
  2. Sales module asks datastore to reserve the space
  3. Datastore accepts the reservation if enough space is available, or can reject the reservation if no space is available.
  4. Rest api response back to host includes reservation id / bool, or error.
Bulat-Ziganshin commented 2 years ago

first, probably these new REST APIs should be documented. It's up to you, just a side thought.

Check Dmitry's post before mine: https://github.com/status-im/nim-codex/issues/159#issuecomment-1205915683

He proposed to implement a very simple approach - we configure maximum disk usage by BlockStpore, and once the quota is reached, we just stop writing to the store. The obvious idea behind that is just to allow a node operator to set a hard limit of disk usage by Codex. As he said, it's just a quick-and-dirty solution. And it was better to implement for the October release.

Your REST API is already more demanding: 1) it requires an ability to compute how much space we are already occupied - well, this feature is already required for Dmitry's approach, so we need to implement it anyway 2) it requires an ability to reserve space for contracts. It can't be implemented now because a) we don't have specific calls to distinguish between the writing of a contracted block and caching of an external block b) we can't distinguish between contracted and cached blocks inside BlockStore

I tried to discuss here the extended variant of feature with Mark - before Dmitry said that we need to implement just the basic one. My thought - let's implement the basic one for the starter, its implementation is anyway first half of the implementation of the extended feature, plus it has more chances to become implemented prior to October release.

emizzle commented 2 years ago

Understood. Keeping it simple for now for phase 1 with iterations later sounds like a very reasonable approach.

Perhaps this leaves us with two options:

  1. Your initial suggestion of creating a datastore API call for unusedDiskQuota(), which could be called when a host is attempting to reserve available space via the rest api call.
  2. Create an datastore API call that asks to reserve space, however at this point it would simply return a bool for whether or not there is currently sufficient space.

In my opinion, option 1 is the simplest for now, however option 2 is most similar to how the flow should be later on.

Bulat-Ziganshin commented 2 years ago

@emizzle EDIT: Yeah, I finally got it. You are right - let's do it in the way you proposed. Also, I would like to modify putBlock API now, so we can distinguish between calls for external and paid blocks - in order to not bother you later with this change.

emizzle commented 2 years ago

it would be better to combine all changes in APIs called by Contracts together so that you can make all modifications on your side in a single PR

Sounds good. I actually don't think any contract changes are required, it would only be a call to the "stupid" reserveSpace API call in datastore from the REST api call in nim-codex. But yes, I'm happy to put all the needed changes (won't be much) in a single PR once that API call has been implemented (or at least on a branch that I can branch off of).

Bulat-Ziganshin commented 2 years ago

I tried to find how contracts use putBlocks and didn't find anything - in the entire program putBlocks is called in 3 places - erasure.nim, block exchange for blocks received from network, and node.store() called by upload REST API

emizzle commented 2 years ago

That's correct. Contract interaction with the blockstore is currently limited to the onStore callback: https://github.com/status-im/nim-codex/blob/e0726cbfb9b22dddf96c09f7523a3c714280fb37/codex/node.nim#L324-L345

fetchBatched eventually calls getBlock, which requests the blocks from the block exchange. Once resolved, these blocks will be added to the local store via putBlock: https://github.com/status-im/nim-codex/blob/e0726cbfb9b22dddf96c09f7523a3c714280fb37/codex/blockexchange/engine/engine.nim#L247-L257

dryajov commented 2 years ago

perhaps the sales module should inform the datastore to "reserve" this space to not be consumed?

I don't think this is the correct approach. The functionality I have in mind is as follows - the user sets a quota that the node is allowed to consume, the node will return how much of that quota is still available for sale.

Also, we should not conflate caching blocks with persisting blocks, they might use the same underlying machinery (i.e. repo abstraction), but they should be configurable separately.

The difference between the two is:

What I'm proposing as a stopgap solution, is to implement the basic caching quota functionality. What the repo should be able to do as a first step is:

  1. report total amount of bytes used
  2. error out if we try to add blocks past the quota

To support (1) and avoid lengthy scans of the repo to calculate the total used byte, we should keep track of how many bytes have been written and/or evicted to/from it. This can be recorded under /total/cachebytes and /total/persistentbytes (or similar).


Full functionality extensions

For the most part, both the caching and persistent quotas can be supported by the same underlying implementation. My current thinking is that we keep some metadata per block to support caching and persisting blocks. We need to keep track of the last time the block was accessed so we can evict it from the caching store and we need a flag that indicates wether the block should be counted as part of the caching or persistent quota.

The algorithm for storing blocks in the repo looks something like the following:

  1. When storing a block 1.1. If the block should be persisted, we set it's persistent flag to true 1.2 We set it's last accessed field to the current time 1.3 If the block is persistent we count it against the /total/persistentbytes otherwise we count it against the /total/cachebytes
  2. When we've reached the caching quota limit 2.1 If the block's persistent flag is not set, we'll evict the blocks with the oldest last accessed timestamp 2.2 If the block's persistent flag is set, we do nothing 2.3 If the caching quota has been reached and no blocks can be evicted (all blocks are persistent), we should not accept any more blocks - i.e. return error
  3. Once all storage contracts that reference a particular block have finished 4.1. Set it's persistent flag to false 4.2. Subtract the total bytes from /total/persistentbytes and add them to /total/cachebytes 4.3. If caching quota has been reached, kick in the cleanup process as per (3)
  4. When we reached the persistent quota limit it should not accept any more blocks - i.e. return error

Edits:

emizzle commented 2 years ago

The functionality I have in mind is as follows - the user sets a quota that the node is allowed to consume, the node will return how much of that quota is still available for sale.

Makes sense. We can update the rest api endpoint to check if there's enough persistent quota available in the datastore.

The maximum amount allowed to be advertised for sale is persistent quota - used persistent quota

@dryajov, will there be a datastore api that allows us to check the amount of persistent storage remaining?

c-blake commented 2 years ago

For the FSStore DataStore it is pretty easy to reserve space by creating a RESERVE file of 100 GiB (or whatever). Then you shrink that file as you add real chunks and grow it as you purge and we can ensure we can always honor the contract. The reserve file size makes implementing quota-used just getFileSize.

This seems like a simple, serviceable & OS portable quota system. Most of us and even many non-expert users have a lot of experience filling filesystems to 100% and then recovering usable space by deleting/truncating files - successfully even if there are not competing writers getting intermittent ENOSPCes. So, I think this, too, could work and even with users pushing things to the brink and returning from said brink.

{ Some, but maybe not all subtleties: For our code we can block writes, but the FS can fill after we shrink RESERVE but before we use it, but selling against an 99+% full FS with alien writers is insoluble. To reserve space, we need to either posix_fallocate or to write the file with data to pre-allocate it rather than just make a "hole". Data written to a de-duplicating FS must be random; writing at all must be optional to not unduly wear out flash memory in testing. There might also need to be a little extra data freed up from RESERVE for directory files. }

For SQLiteStore, big questions arise some of which @michaelsbradleyjr starts.."Does vacuum work if the host FS is 100% full, when compaction matters most - if not are you just stuck?", "If it works, what is the cost?", "Does DELETE,INSERT in the same table (or any other reservation idea) work as hoped with BLOB types of varying chunk sizes?" Maybe more.

(A user with admin-ish rights on Unix can make an "FS within a file" to at least keep codex from crashing unluckily partitioned systems with disk full errors. Maybe worth mentioning to early adopters as a way to not trust whatever we do here. Apologies if this is old advice..It also has application in terms of our testing how well things work as you fill the host FS.)

dryajov commented 1 year ago

Initial implementation, related to limits has been done in #319; subsequent work for repo maintenance is being carried out in #347.

benbierens commented 1 year ago

I was thinking that for the purpose of deciding when to delete blocks, there could be no difference between cached and contract blocks: Cached blocks are kept for a short time in accordance with the caching settings. Contract blocks are kept for a longer but still limited time in accordance with the contract. I was thinking the expiration datetime for a contract block should be the duration of the contract + some grace period (hours?, days?, configurable).

I am unsure about whether the cached blocks and contract blocks can be treated identically when it comes to quotas. Do we want a max quota for the entire node, or one quota for caching and another for contract storage? Additionally, I like the idea of marking a block as non-persistent when the contract has expired. Currently, we delete blocks as soon as their expiration date has been reached. Perhaps we could consider a strategy of "don't delete unless you have to", where we fill up the quota on purpose and start cleaning the oldest, least interesting, non-persistent (any more) blocks only when we have no other way to respect the quota? Downsides: Write performance goes down because we may need to quickly make some room. It becomes trickier to answer the question how much space is available.

dryajov commented 1 year ago

Some of the questions raised in this issue have already been addressed by the repostore implementation.

Cached blocks are kept for a short time in accordance with the caching settings. Contract blocks are kept for a longer but still limited time in accordance with the contract.

This is how it works already, the repostore treats all blocks equally and only relies on expiration to evict from the store.

I was thinking the expiration datetime for a contract block should be the duration of the contract + some grace period (hours?, days?, configurable).

This might be a nice addition, a grace period certainly makes sense in some circumstances...

I am unsure about whether the cached blocks and contract blocks can be treated identically when it comes to quotas.

The repostore has a notion of reserved storage and used storage. The marketplace can "reserve" the required amount before announcing/bidding on requests and release is right before writing blocks. This is a super simple but powerful mechanism that should cover all foreseeable use cases.

The flow is super simple, there is a reserve call which will reserve a number of bytes counted against all the used quota bytes - it should be called right before bidding on storage requests, and there is a release call - which should be called right before storing a block (not that I said block not blocks - this to prevent concurrency issues, but if that becomes problematic we can always add some synchronization to it).