Open dryajov opened 2 years ago
while working on the BlockStore API, I found that current code doesn't use delBlock at all, meaning that local store can only grow :) So I think it's also important feature for any practical usage, no less that SQLite backend.
Yes, this is definitely an important feature for a usable product, so this should be relatively high on our priority list.
Yes, all issues that currently in our priority list are assigned to me or @michaelsbradleyjr, so we will take care of it.
If this simple approach will go out of control, we can create milestone for them.
I have started researching some possibilities.
For SQLiteStore
introduced in #160 we could make use of SQLite's DBSTAT Virtual Table.
The SQLITE_ENABLE_DBSTAT_VTAB
compile-time option seems to already be enabled in our builds via nim-sqlite3-abi, e.g. this query works for a database created by an instance of type SQLiteStore
SELECT pgsize FROM dbstat WHERE name='Store' AND aggregate=TRUE;
The result is "the total amount of disk space used" in bytes. Note that "total" doesn't include "Freelist pages, pointer-map pages, [or] the lock page", so numbers reported by e.g. ls
may include some overhead.
We could periodically query the dbstat
info and if the database is over quota, we could sum the data size for the oldest N
rows (100 or 1000 or whatever) with respect to the timestamp column that's already part of the Store
table created by nim-datastore. Repeat M
times until we've reached the amount of space we want to clear within some delta, then delete the oldest N * M
rows.
That would require some custom queries, but nim-datastore exports the necessary fields and helpers so that it's entirely doable.
We would need to do some experimentation in relation to VACUUM
and whether we want/need to set auto_vacuum=FULL
.
I don't have an idea yet for the best approach for FSSstore
, but if we end up settling on SQLiteStore
then maybe we won't need to worry about it.
I'm not aware of any general cross-platform solution for a percentage-based quota. For the platforms we primarily target (Linux, macOS, Windows) we can probably shell out and use cli utilities in PATH
, and log a warning if the expected utility/ies were not present, errored, or otherwise gave a result that Codex is unable to parse.
Doesn't SQLite provide some native facility to limit the DB size?
Doesn't SQLite provide some native facility to limit the DB size?
even if it does, we need solution deleting LRU blocks rather than hard limit that will just prevent adding new data to DB.
Taking another crack at this from an API perspective.
This is a sketch, not getting into details of how to deal with cli params, whether the API should be method
or proc
, or other specifics.
What if for type BlockStore
we have
method quota(self: BlockStore, percentage: Percentage): Future[?!int]
method quota(self: BlockStore, bytes: int): Future[?!int]
Where the return value is the number of bytes over/under quota.
In the case of percentage: Percentage
, the param value indicates the max percentage of storage space that the blockstore should be able to consume relative to the current total size of the filesystem on which it resides.
In the case of bytes: int
, the param value indicates the max size of the blockstore.
How the return value is calculated would depend on the implementation of ref object of BlockStore
and possibly the target OS.
And we could also have
method purge(self: BlockStore, bytes: Positive, delta: Percentage): Future[?!Natural]
Where the return value is the number of bytes actually purged.
The bytes: Positive
param value indicates how many bytes should be purged.
The delta: Percentage
param value indicates a percentage range +/- for the return value that is considered acceptable/expected relative to bytes: Positive
. If the value to be returned is outside the range, a warning could be logged by method purge
prior to return.
How the purge
operation is performed would depend on the implementation of ref object of BlockStore
and possibly the target OS.
It would be expected that purge
implementations take into consideration LRU metadata for blocks derived from e.g. the filesystem or extra data in the database.
For SQLiteStore
that would likely involve using a Datastore
implementation (maybe Codex-specific) that derives from type SQLiteDatastore
and has an extra column in the Store
table. That's certainly within the realm of possibility, and I have some ideas for that (i.e. how that column would be periodically updated), but the specifics can be explored later here or in another context.
Something to keep in mind: there are also blocks that we should keep around because they're part of a contract that we're engaged in. Not sure whether these should go into a separate store, or that the store itself should treat them differently.
Yeah, contracts and these restrictions contradict, so we may choose between:
Also, we essentially have two parts in blockstore, and their sizes probably should be regulated independently:
The Sales module only sells storage that it's been told that it is available. So we only get into contracts for disk space that we explicitly marked as available.
So you can let contracts know about the restrictions.
But for the starter, let's develop the system ignoring contracts problem. I think we can use any of three approaches:
As far as we have "last access time" for each block, we can run DELETE queries requesting to delete N blocks with oldest timestamps. I think this solves simplfied (contract-ignoring) problem for SQLite backend.
So, the algorithm is:
setLimits(BlockStore; lowLimit, highLimit: int64)
API and corresponding config parameters.Since SQLite is inherently single-threaded system (AFAIK), we can still have too large delay on a single mass-DELETE operation. It may be preferable to delete records in smaller batches.
Actually, to distribute load evenly, we can each time delete only a few records, and even just a signle one in extremum. So, alternative algorithm:
select count
is expensive, but we may make these requests regularly to make sure we aren't offSo you can let contracts know about the restrictions.
@markspanbroek I think about adding to BlockStore a request availableContractSpace: int64
or isAvailableContractSpace(int64)
, so when you get opportunity to fill a slot, you consult with BlockStore to check whether we can participate. What you think about it?
Sidenote: we can have such checks with other subsystems too (CPU, Network and disk bandwidth required for PoRs).
I would rather have a way to reserve space in the BlockStore. I expect a user would want to dedicate an amount of storage for sale explicitly. There's a REST API endpoint for doing that in the codex node already. When the user indicates that it wants to sell e.g. 100 Gb, the store should always keep this space available for contracts to use.
For now we simply need something that will refuse to write past a specific limit, without doing any automatic deletion or cleanup.
What I would like to see as a first step is:
@dryajov this approach will stop caching of new data once we reached the limit. So, yes - it will work, but data downloaded from the network will be not cached anymore. In particular, #171 (prefetching) will not work.
@dryajov this approach will stop caching of new data once we reached the limit. So, yes - it will work, but data downloaded from the network will be not cached anymore. In particular, #171 (prefetching) will not work.
Thats correct, I want the minimum possible set of functionality right now, we can evolve this iteratively. In particular, it should be fine to stop storing new blocks once we reach the limit.
In particular this means that ECC will became useless - we rely on storing decoded data in the local store.
In particular this means that ECC will became useless - we rely on storing decoded data in the local store.
How so? We're only talking about limiting storage use, nothing else. Sure some files would not be downloadable in full due to size, but this is already true - you can't download more than the size of the hard drive.
I would like to avoid all the complexity associated with "hot" pruning (client is running) for now and focus on a simple and concrete task. We'll come back to this once we have a better understanding of how the software behaves and what sort of real requirements we are working with.
As a second step, we can do "cold" pruning (client is stopped), which should give us a good understanding of what sort of issues we're going to be dealing with once we move to "hot" pruning.
How so?
Recall this discussion about erasureJob
:
So, once we stopped saving data from network to local store, we may get very slow downloads and ECC will definitely stop working. Prefetcher/decoder overall rely on our ability to store entire file in cache.
How so?
Recall this discussion about
erasureJob
:
- We spawn erasureJob or prefetchBlocks and then go to download data sequentially by block CIDs via StoreStream
- So, if the all original data are fine, we will not prefetch data, and just receive them in slow sequential way via StoreStream
- but if some original block is lost, we can't restore it because we rely on sending recovered blocks via local store
So, once we stopped saving data from network to local store, we may get very slow downloads and ECC will definitely stop working. Prefetcher/decoder overall rely on our ability to store entire file in cache.
Yes, thats expected, I never meant this issue to handle more complex cases than simple limits on the total amount of bytes the repo can occupy. We can live with some level of degraded functionality once we have reached our storage quota, the alternative being simply stopping the process.
Doing proper hot pruning is not an easy task which overall buys us very little at this point. We should improve it, but I don't see any pressing need to do it immediately.
From discussion with Dmitry:
node.unusedDiskQuota()
@emizzle will it work for you?
Return to the marketing team the amount of disk quota not yet used - add API node.unusedDiskQuota()
I'm not sure if this node > sales
info would be useful. Instead, perhaps the information should flow in the opposite direction, such that the sales modules can inform the datastore how much space it is using in established contracts or wants to use in future contracts, ie "reserve" the space in the datastore?
As @markspanbroek pointed out, currently a host specifies how much space they want to make available to sell (via rest api). This "availability" is modified by the sales module, such that when a contract is started, the availability is removed (even if the availability was greater than the amount consumed in the contract -- something we probably should optimise later) and when the contract has finished, the availability is added back. In either case, once availability is added by the host (via rest api), perhaps the sales module should inform the datastore to "reserve" this space to not be consumed? What do you think @Bulat-Ziganshin?
The flow could be something like this:
first, probably these new REST APIs should be documented. It's up to you, just a side thought.
Check Dmitry's post before mine: https://github.com/status-im/nim-codex/issues/159#issuecomment-1205915683
He proposed to implement a very simple approach - we configure maximum disk usage by BlockStpore, and once the quota is reached, we just stop writing to the store. The obvious idea behind that is just to allow a node operator to set a hard limit of disk usage by Codex. As he said, it's just a quick-and-dirty solution. And it was better to implement for the October release.
Your REST API is already more demanding: 1) it requires an ability to compute how much space we are already occupied - well, this feature is already required for Dmitry's approach, so we need to implement it anyway 2) it requires an ability to reserve space for contracts. It can't be implemented now because a) we don't have specific calls to distinguish between the writing of a contracted block and caching of an external block b) we can't distinguish between contracted and cached blocks inside BlockStore
I tried to discuss here the extended variant of feature with Mark - before Dmitry said that we need to implement just the basic one. My thought - let's implement the basic one for the starter, its implementation is anyway first half of the implementation of the extended feature, plus it has more chances to become implemented prior to October release.
Understood. Keeping it simple for now for phase 1 with iterations later sounds like a very reasonable approach.
Perhaps this leaves us with two options:
unusedDiskQuota()
, which could be called when a host is attempting to reserve available space via the rest api call.In my opinion, option 1 is the simplest for now, however option 2 is most similar to how the flow should be later on.
@emizzle EDIT: Yeah, I finally got it. You are right - let's do it in the way you proposed. Also, I would like to modify putBlock API now, so we can distinguish between calls for external and paid blocks - in order to not bother you later with this change.
it would be better to combine all changes in APIs called by Contracts together so that you can make all modifications on your side in a single PR
Sounds good. I actually don't think any contract changes are required, it would only be a call to the "stupid" reserveSpace
API call in datastore from the REST api call in nim-codex. But yes, I'm happy to put all the needed changes (won't be much) in a single PR once that API call has been implemented (or at least on a branch that I can branch off of).
I tried to find how contracts use putBlocks and didn't find anything - in the entire program putBlocks is called in 3 places - erasure.nim, block exchange for blocks received from network, and node.store() called by upload REST API
That's correct. Contract interaction with the blockstore is currently limited to the onStore
callback: https://github.com/status-im/nim-codex/blob/e0726cbfb9b22dddf96c09f7523a3c714280fb37/codex/node.nim#L324-L345
fetchBatched
eventually calls getBlock
, which requests the blocks from the block exchange. Once resolved, these blocks will be added to the local store via putBlock
: https://github.com/status-im/nim-codex/blob/e0726cbfb9b22dddf96c09f7523a3c714280fb37/codex/blockexchange/engine/engine.nim#L247-L257
perhaps the sales module should inform the datastore to "reserve" this space to not be consumed?
I don't think this is the correct approach. The functionality I have in mind is as follows - the user sets a quota that the node is allowed to consume, the node will return how much of that quota is still available for sale.
Also, we should not conflate caching blocks with persisting blocks, they might use the same underlying machinery (i.e. repo abstraction), but they should be configurable separately.
The difference between the two is:
caching quota
will be used when downloading/uploading data, while persistent quota
will be advertised for salecaching quota + persistent quota
persistent quota - used persistent quota
caching quota
caching quota
What I'm proposing as a stopgap solution, is to implement the basic caching quota
functionality. What the repo should be able to do as a first step is:
To support (1) and avoid lengthy scans of the repo to calculate the total used byte, we should keep track of how many bytes have been written and/or evicted to/from it. This can be recorded under /total/cachebytes
and /total/persistentbytes
(or similar).
For the most part, both the caching and persistent quotas can be supported by the same underlying implementation. My current thinking is that we keep some metadata per block to support caching and persisting blocks. We need to keep track of the last time the block was accessed so we can evict it from the caching store and we need a flag that indicates wether the block should be counted as part of the caching
or persistent
quota.
The algorithm for storing blocks in the repo looks something like the following:
persistent
flag to true
1.2 We set it's last accessed
field to the current time
1.3 If the block is persistent
we count it against the /total/persistentbytes
otherwise we count it against the /total/cachebytes
caching quota
limit
2.1 If the block's persistent
flag is not set, we'll evict the blocks with the oldest last accessed
timestamp
2.2 If the block's persistent
flag is set, we do nothing
2.3 If the caching quota
has been reached and no blocks can be evicted (all blocks are persistent), we should not accept any more blocks - i.e. return errorpersistent
flag to false
4.2. Subtract the total bytes from /total/persistentbytes
and add them to /total/cachebytes
4.3. If caching quota
has been reached, kick in the cleanup process as per (3)persistent quota
limit it should not accept any more blocks - i.e. return errorEdits:
persistent quota
limit behaviorThe functionality I have in mind is as follows - the user sets a quota that the node is allowed to consume, the node will return how much of that quota is still available for sale.
Makes sense. We can update the rest api endpoint to check if there's enough persistent quota available in the datastore.
The maximum amount allowed to be advertised for sale is
persistent quota - used persistent quota
@dryajov, will there be a datastore api that allows us to check the amount of persistent storage remaining?
For the FSStore DataStore it is pretty easy to reserve space by creating a RESERVE file of 100 GiB (or whatever). Then you shrink that file as you add real chunks and grow it as you purge and we can ensure we can always honor the contract. The reserve file size makes implementing quota-used
just getFileSize
.
This seems like a simple, serviceable & OS portable quota system. Most of us and even many non-expert users have a lot of experience filling filesystems to 100% and then recovering usable space by deleting/truncating files - successfully even if there are not competing writers getting intermittent ENOSPCes. So, I think this, too, could work and even with users pushing things to the brink and returning from said brink.
{ Some, but maybe not all subtleties: For our code we can block writes, but the FS can fill after we shrink RESERVE but before we use it, but selling against an 99+% full FS with alien writers is insoluble. To reserve space, we need to either posix_fallocate
or to write the file with data to pre-allocate it rather than just make a "hole". Data written to a de-duplicating FS must be random; writing at all must be optional to not unduly wear out flash memory in testing. There might also need to be a little extra data freed up from RESERVE for directory files. }
For SQLiteStore, big questions arise some of which @michaelsbradleyjr starts.."Does vacuum
work if the host FS is 100% full, when compaction matters most - if not are you just stuck?", "If it works, what is the cost?", "Does DELETE,INSERT in the same table (or any other reservation idea) work as hoped with BLOB types of varying chunk sizes?" Maybe more.
(A user with admin-ish rights on Unix can make an "FS within a file" to at least keep codex
from crashing unluckily partitioned systems with disk full errors. Maybe worth mentioning to early adopters as a way to not trust whatever we do here. Apologies if this is old advice..It also has application in terms of our testing how well things work as you fill the host FS.)
Initial implementation, related to limits has been done in #319; subsequent work for repo maintenance is being carried out in #347.
I was thinking that for the purpose of deciding when to delete blocks, there could be no difference between cached and contract blocks: Cached blocks are kept for a short time in accordance with the caching settings. Contract blocks are kept for a longer but still limited time in accordance with the contract. I was thinking the expiration datetime for a contract block should be the duration of the contract + some grace period (hours?, days?, configurable).
I am unsure about whether the cached blocks and contract blocks can be treated identically when it comes to quotas. Do we want a max quota for the entire node, or one quota for caching and another for contract storage? Additionally, I like the idea of marking a block as non-persistent when the contract has expired. Currently, we delete blocks as soon as their expiration date has been reached. Perhaps we could consider a strategy of "don't delete unless you have to", where we fill up the quota on purpose and start cleaning the oldest, least interesting, non-persistent (any more) blocks only when we have no other way to respect the quota? Downsides: Write performance goes down because we may need to quickly make some room. It becomes trickier to answer the question how much space is available.
Some of the questions raised in this issue have already been addressed by the repostore
implementation.
Cached blocks are kept for a short time in accordance with the caching settings. Contract blocks are kept for a longer but still limited time in accordance with the contract.
This is how it works already, the repostore
treats all blocks equally and only relies on expiration to evict from the store.
I was thinking the expiration datetime for a contract block should be the duration of the contract + some grace period (hours?, days?, configurable).
This might be a nice addition, a grace period certainly makes sense in some circumstances...
I am unsure about whether the cached blocks and contract blocks can be treated identically when it comes to quotas.
The repostore
has a notion of reserved storage and used storage. The marketplace can "reserve" the required amount before announcing/bidding on requests and release is right before writing blocks. This is a super simple but powerful mechanism that should cover all foreseeable use cases.
The flow is super simple, there is a reserve
call which will reserve a number of bytes counted against all the used quota bytes - it should be called right before bidding on storage requests, and there is a release
call - which should be called right before storing a block (not that I said block not blocks - this to prevent concurrency issues, but if that becomes problematic we can always add some synchronization to it).
Currently, the local repository doesn't implement any storage limits and it's possible to fill up the entirety of the users drive.
We need to add:
%
of bytes available to codex